Igor's Website - Articles - Text file compression with Huffman algorithm

Science, stories, art and music.

Science / Computer science articles.

Text file compression with Huffman algorithm

Purpose

Here I will show how to use Huffman coding to compress text files. This implementation is in C++.

Huffman coding is a type of coding that allows lossless compression of data. It is a type of statistical coding, where some message is analyzed and repetitions are found for various dictionary items. Here, we will make a compression/decompression program that will use Huffman algorithm to encode/decode data stored in a file.

Inspecting the text file

Before we can proceed to compression, we need to analyze the contents of the file, so that we can tell the Huffman coder how to encode the characters of the text. We need to find out how many times each of the characters appears in the file.

int inspectFile(string name)
{
	ifstream dat;
	dat.open(name, ios::in);
	if (!dat.is_open())
		return 0;
	while (!dat.eof())
		cs.addElement(dat.get());
	dat.close();
	cs.removeLast();
	return 1;
}

For this purpose, we will create the CountingSet class. This is, basically, a normal set that counts the insertions of the same element. If a new element is inserted into the set, it is assigned a unique position in the internal array of the set, otherwise, the counter of the corresponding element is incremented.

void CountingSet::addElement(char e)
{
	for (unsigned i=0; i<ems.size(); ++i)
		if (ems[i].c==e)
		{
			++ems[i].count;
			return;
		}
	Element temp;
	temp.c=e;
	temp.count=1;
	ems.push_back(temp);
}

This provides a way to count the number of occurrences for each character of the file in a single reading of the file. Percentages are trivially obtained from these values.

void CountingSet::getPercentages(vector& percs)
{
	int count=0;
	for (unsigned i=0; i<ems.size(); ++i)
	{
		count+=ems[i].count;
		percs.push_back(ems[i].count*100);
	}
	for (unsigned i=0; i<ems.size(); ++i)
		percs[i]/=count;
}

A function is provided for removing the last element, since upon reading the file, the last element inserted would be the EOF (end of file) character.

Huffman coding

I will explain simply what Huffman coding is. Details of it can be found on the Wikipedia page for Huffman coding.

If we have a file containing only text, we can find the percentage of appearances of each character in the file. This enables us to find out which characters are used more than the others are. This will enable us to compress this text. If we find that the character ‘a’ is used the most times in our file, and the character ‘x’ is used the least number of times, we can encode ‘a’ with less bits than ‘x’ to achieve compression. If ‘a’ is used often, encoding it with less bits than ASCII defined, we will be able to save space. For a large file, this would mean significant compression.

void HuffmanCoding::generateCodes()
{
	while (nodes.size()>1)
	{
		Node *temp = nodes[0];
		nodes[0] = new Node;
		nodes[0]->c=0;
		nodes[0]->probab=temp->probab+nodes[1]->probab;
		nodes[0]->left = new Node;
		*(nodes[0]->left)=*temp;
		nodes[0]->right = new Node;
		*(nodes[0]->right)=*nodes[1];
		delete temp;
		delete nodes[1];
		nodes.erase(nodes.begin()+1);
		sort(nodes.begin(), nodes.end(), comparer);
	}
	queue<Node*> Q;
	Q.push(nodes[0]);
	while (Q.size())
	{
		Node *temp = Q.front();
		Q.pop();
		if (temp->left==NULL && temp->right==NULL)
		{
			Element tmp;
			tmp.c=temp->c;
			tmp.code=temp->code;
			codes.push_back(tmp);
			continue;
		}
		if (temp->left!=NULL)
			(temp->left->code=temp->code).append("0"), Q.push(temp->left);
		if (temp->right!=NULL)
			(temp->right->code=temp->code).append("1"), Q.push(temp->right);
	}
}

Basically, what we are doing is encoding each character with a number of bits that corresponds to how many times that character appears in the file. In order for a text encoded in such a way to be readable, we need to provide a unique way to decode the characters (so that there are no ambiguities for decoding pairs of characters), and we need to provide a table to tell the decoder which character is encoded in which way.

The header of the HuffmanCoding class is given below.

class HuffmanCoding
{
private:
	struct Element
	{
		char c;
		string code;
	};

	struct Node
	{
		char c;
		string code;
		double probab;
		Node *left, *right;
	};

	struct comparison
	{
		bool operator() (Node*& i, Node*& j) 
		{ 
			return (i->probab<j->probab);
		}
	} comparer;

	vector<Element> codes;
	vector<Node*> nodes;
	Node *root;
	void generateCodes();
public:
	HuffmanCoding(vector<char>&, vector<double>&);
	~HuffmanCoding(void);
	void getCodes(vector<string>&, const vector<char>&);
};

Given a map of characters and their percentage of occurrence in the file, the getCodes() method should return the Huffman codes for these characters.

void HuffmanCoding::getCodes(vector<string>& cds, const vector<char>& chars)
{
	generateCodes();
	int size = chars.size();
	for (int i=0; i<size; ++i)
		for (int j=0; j<size; ++j)
			if (codes[j].c==chars[i])
			{
				cds.push_back(codes[j].code);
				continue;
			}
}

Compression

After we have obtained the Huffman codes for the characters in the file, we can proceed to the compression of the file.

cout<<"Enter archive name: ";
comprName=fileName;
while (comprName==fileName)
	cin>>comprName;
if (!inspectFile(fileName))
{
	cout<<"Error inspecting file!";
	cin.get();
	return 0;
}
cs.getPercentages(pers);
cs.getCharacters(chrs);
HuffmanCoding hCodes(chrs,pers);
hCodes.getCodes(codes, chrs);
createMap();
Compressor comp(map);
comp.compress(fileName, comprName);
cout<<"Compression done.";

Before we go into the core of the compression algorithm, let’s point out the structure of our compressed file. The file needs to have a header that tells the decompressor how to interpret the data found in the body of the file. This header should basically contain the map generated during the inspection of the source file (inspection of the file we are compressing). We will also need to reserve space for the padding of the last character.

void Compressor::createHead()
{
	head.push_back(0);
	head.push_back(map.size());
	for (unsigned i=0; i<map.size(); ++i)
		head.push_back(map[i].ch);
	for (unsigned i=0; i<map.size(); ++i)
		head.push_back(map[i].code.size());
	for (unsigned i=0; i<map.size(); ++i)
		bodyBuffer+=map[i].code;
}

The information in the header contains the length of the table (map) of character codes, the characters in their order in the table, and the codes in their order in the table. This way, it is simple to read the compressed data.

In order to encode data properly, we need to have a buffer, since some characters can be longer than a byte. In this program, we will use a 512-bit buffer, and we will use string to represent bits. When committing the data to a file, we will convert each 8 characters of the string to a byte in the file.

string Compressor::loadBuffer(ifstream& strm)
{
	if (bodyBuffer.size()<256)
	{
		char chr;
		while (bodyBuffer.size()<256)
		{
			chr = strm.get();
			if (strm.eof()) break;
			for (unsigned j=0; j<map.size(); ++j)
				if (map[j].ch==chr)
				{
					bodyBuffer+=map[j].code;
					break;
				}
		}
		if (strm.eof())	done=true;
		if (done)
		{
			clearBuffer();
			return bodyBuffer;
		}
	}
	string temp="";
	temp.reserve(256);
	for (int i=0; i<256; ++i)
		temp+=bodyBuffer[i];
	clearBuffer();
	return temp;
}

The file writing operation is a simple loop which ends when the buffer is empty and there is no more data to be added to it. We add data to the buffer in the following way: we take a character from the input file, find its code in the table and concatenate that code to the buffer. When we have 256 characters (256 bits) in the buffer, we commit them to the file.

void Compressor::createBody()
{
	ofstream file;
	ifstream dat;
	file.open(toFile, ios::out|ios::binary);
	dat.open(fromFile, ios::in|ios::binary);
	if (!file.is_open()) return;
	for (unsigned i=0; i<head.size(); ++i)
	{
		char buffer[1];
		buffer[0]=head[i];
		file.write(buffer,1);
	}
	while (!done)
	{
		string tmp = loadBuffer(dat);
		char buffer[256];
		int intBuffer[256];
		for (unsigned i=0; i<tmp.size()/8; ++i)
			intBuffer[i]=0;
		for (unsigned i=0; i<tmp.size(); i+=8)
			for (unsigned j=i; j<i+8; ++j)
			{
				if (tmp[j]=='1')
					intBuffer[i/8]+=(int)pow((double)2,(double)(7-j%8));
			}	
		for (unsigned i=0; i<tmp.size(); ++i)
			buffer[i]=intBuffer[i];
		file.write(buffer, tmp.size()/8);
		file.flush();
	}
	file.seekp(ios::beg);
	char buffer[1];
	buffer[0]=(char)endBlanks;
	file.write(buffer, 1);
	file.close();
	dat.close();
}

After the last character has been encoded, we return to the start of the file, and write the padding for the last character (the file length in bits must be a multiple of 8, so we have to tell the decompressor how many blank bits there are at the end of the file).

Decompression

The decompressor works in the opposite way. First, it reads the contents of the header, and builds a map of characters and their corresponding codes.

void Decompressor::extract()
{
	ifstream cFile;
	cFile.open(fromFile,ios::in | ios::binary);
	if (!cFile.is_open()) return;
	endBlanks=cFile.get();
	int mapSize;
	mapSize=cFile.get();
	vector<int> codeSizes;
	for (int i=0; i<mapSize; ++i)
	{
		Pair temp;
		temp.ch=cFile.get();
		chrMap.push_back(temp);
	}
	for (int i=0; i<mapSize; ++i)
		codeSizes.push_back(cFile.get());
	for (int i=0; i<mapSize; ++i)
		if (maxSize<codeSizes[i])
			maxSize=codeSizes[i];
	for (int i=0; i<mapSize; ++i)
		chrMap[i].code=getFromBuffer(cFile,codeSizes[i]);
	ofstream uFile;
	uFile.open(toFile,ios::out);
	while (!done)
		uFile<<getNext(cFile);
	cFile.close();
	uFile.close();
}

From there, the process is very similar to compression, however we read the contents of the file into a buffer, and when a match is found between the beginning of the buffer and a code in the buffer, the corresponding character is committed to the decompressed file.

We use the getFromBuffer() method to obtain the next character from the buffer.

string Decompressor::getFromBuffer(ifstream& str, int leng)
{
	while ((unsigned)leng>bodyBuffer.size())
	{
		unsigned char tmp = (unsigned char)str.get();
		bodyBuffer+=charToBin(tmp);
	}
	string ret="";
	ret.reserve(leng);
	for (int i=0; i<leng; ++i)
		ret+=bodyBuffer[i];
	bodyBuffer=bodyBuffer.substr(leng);
	return ret;
}

The getNext() method is called to obtain the next character to commit to the file until the end of the compressed file is reached.

char Decompressor::getNext(ifstream& str)
{
	while ((unsigned)maxSize>bodyBuffer.size() && !str.eof())
	{
		unsigned char tmp = (unsigned char)str.get();
		bodyBuffer+=charToBin(tmp);
	}
	if (str.eof())
	{
		reverse(bodyBuffer.begin(),bodyBuffer.end());
		bodyBuffer=bodyBuffer.substr(endBlanks);
		reverse(bodyBuffer.begin(), bodyBuffer.end());
		done=true;
	}
	string s = "";
	s.reserve(maxSize);
	for (int i=0; i<maxSize; ++i)
	{
		s+=bodyBuffer[i];
		for (unsigned j=0; j<chrMap.size(); ++j)
			if (s==chrMap[j].code)
			{
				bodyBuffer=bodyBuffer.substr(i+1);
				char chr=chrMap[j].ch;
				return chr;
			}
	}
	return -1;
}

In this method we will match the buffer content to the content of the map, to obtain the next character. If the next character is the EOF character, we remove the excess characters from the buffer.

Improving the algorithm

The above example can obviously be improved. One way this can be achieved is to make a modification to the counting of the characters. If the file contains text, this text can be searched for words. A modification of the Lempel-Ziv algorithm can be used to do this. A dictionary can be formed, and the words in the dictionary encoded using the Huffman algorithm.

The speed of the algorithm can be improved by using bitwise operations instead of storing the data in strings.

Source code

This source code is available under Attribution Assurance License.

The source code for the application is available here: compressionAlgorithm.zip

21.4.2013

Google Code Prettify

I use Google Prettify to format the source code in my articles. If the code is displaying in one line, you can try opening the page in a different browser.

Request software design

If you wish to have a specific application designed, contact me at software@igorsevo.com. If you want to know more about what I do, check out my home page and Science page.

Support this site

Suggest an article

You can suggest an article or ask a question on the Questions page.

Social