Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Jang Mini Project With Source Code

VIEWS: 592 PAGES: 98

A Compression is required to reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. On the downside, compressed data must be decompressed to be used, and this extra processing may be detrimental to some applications. The design of data compression schemes therefore involves trade-offs among various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme), and the computational resources required to compress and uncompress the data. Our mini-project deals with the compression of text files based on the language they are written in. In this case, we’ve used the English language exclusively but the concept may easily be extended to other languages. ASCII encoding, which is the most commonly used format to represent textual content, uses a byte, i.e. 8 bits to represent a character. This type of encoding involves a large amount of redundancy, as it has to account for 256 possible values for each character. We have created an original compressed file format that removes this redundancy to a great extent. The essence of our project lies in converting a normal ASCII file to this special format (a compression algorithm) and back to ASCII (a decompression algorithm). The basis of the improved format lies in the frequency with which certain words appear, or are likely to appear, in a practical text document written in the English language. Since this algorithm is meant to be of practical use, a meaningful file (say, a novel or an article) would undergo far greater compression than one without any meaning. Our compression algorithm uses a dictionary to identify commonly used words. It also generates a dictionary from the input text file so that words that appear frequently within the document may be identified.

More Info
									SYNOPSIS
A Compression is required to reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. On the downside, compressed data must be decompressed to be used, and this extra processing may be detrimental to some applications. The design of data compression schemes therefore involves trade-offs among various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme), and the computational resources required to compress and uncompress the data.

Our mini-project deals with the compression of text files based on the language they are written in. In this case, we’ve used the English language exclusively but the concept may easily be extended to other languages. ASCII encoding, which is the most commonly used format to represent textual content, uses a byte, i.e. 8 bits to represent a character. This type of encoding involves a large amount of redundancy, as it has to account for 256 possible values for each character.

We have created an original compressed file format that removes this redundancy to a great extent. The essence of our project lies in converting a normal ASCII file to this special format (a compression algorithm) and back to ASCII (a decompression algorithm). The basis of the improved format lies in the frequency with which certain words appear, or are likely to appear, in a practical text document written in the English language. Since this algorithm is meant to be of practical use, a meaningful file (say, a novel or an article) would undergo far greater compression than one without any meaning.

Our compression algorithm uses a dictionary to identify commonly used words. It also generates a dictionary from the input text file so that words that appear frequently within the document may be identified.

JANG – A Text File Compressor

CONTENTS
1. 2. INTRODUCTION SYSTEM STUDY
2.1 2.2 Theory of system analysis Theory of compression 2.2.1 Data compression overview 2.2.2 Types of compression Applications of compression Existing algorithms Limitations of the existing system Proposed system Degree of application of the proposed system Proposed updates Proposed implementation details Feasibility Study System Requirements

2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11

3.

DESIGN AND TESTING.
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 Requirement Analysis Benchmarking Architecture High Level Design Dataflow Charts Detailed Design Bit Patterns Testing 3.8.1 Testing fundamentals 3.8.2 Testing the application

4.

IMPLEMENTATION
4.1 4.2 4.3 Application files Installation Execution

5. 6. 7

CONCLUSION REFERENCES APPENDIX
Appendix A. Appendix B. Program flow chart Source code and screen shots

2

JANG – A Text File Compressor

INTRODUCTION
3

1

JANG – A Text File Compressor

1.INTRODUCTION

1.1

Organizational Profile

Federal Institute of Science and Technology (FISAT) is one of the front-runners among the new generation self Financing Engineering Colleges in Kerala . The college being unique in being, perhaps, the only professional in the country runs by the Trade Unions of Bank Professionals with the Patronage of a bank. FISAT is established and managed by Federal Bank Officers association Educational Society (FBOAES), an initiative of the Federal Bank Officers Association. The college started functioning in the year 2002 at the campus at Hormis Nagar, Mookkannoor, near Angamaly, which is the birth place of the late K.P.Hormis, founder of the Federal Bank.

The college is afflicted to the Mahatma Gandhi University, Kottayam, Kerala. At present the college offers B.Tech degree in four branches of engineering with a total intake of 360 each year.

The college has well qualified and experienced faculty and has also excellent facilities in the form of well equipped Laboratories, workshops, Computer centre, Library etc for the smooth and efficient conduct of the above said programmes. Other facilities like hostels for boys and girls, store, canteen, college buses etc are also available. Language Lab, Internet café and Fitness centre are some of the additional facilities provided for the students.

A group advisory system is implemented in the college for the benefit of the students. A staff member will be in charge of a batch of 30 students. The system is intended to give advice and guidance to the students in all curricular and extra curricular matters. The students can meet the advisor and discuss their personal and academic problems.

Library is fully automated with more than 7000 volumes of text books and reference books in over 1800 titles. The Digital Library collection includes more than 350 CD-ROMs and DVDs. Major Technical journals and Science magazines are subscribed. A book bank scheme

4

JANG – A Text File Compressor where the members will be issued one standard textbook in each subject for use in an entire semester is also operating in the library.

Well-oriented career guidance program with emphasis for good performance in competitive exams, group discussions and interviews is operational. Effective placement programmes are initiated. Good rapport and interaction with the Industries are maintained. Student’s Technical Associations

1.2

All the departments have very active department associations. The various associations are

Computer Science and Engineering: THYRA Electronics and Communication: ECHO Electronics and Instrumentation: IDEA Electrical and Electronics: ELECTRA

A student branch of IEEE is actively functioning in the college. This branch organizes various programs aimed at sharpening the professional skills the student. The student council has representatives elected from all the classes. The students’ council helps to develop leadership qualities in students and prepare them to take up challenges confidently in their professional career.

A very active Parent Teacher Association, to establish and to promote an esprit de corps among the parents and teachers, is functioning the college.

5

JANG – A Text File Compressor 1.3 Introduction to JANG Text Compressor JANG is a novel scheme to reduce the amount of space required for storage of large text documents. It does this by encoding the text document into a intuitively designed compressed file format. This software can very well retrieve from the encoded file, the original text document using its lossless compression and decompression algorithms.

1.4 Objective of JANG Text Compressor

The basic objective of the application are as follows    To compress the text files using a lossless compression method so that the file can be exactly retrieved To provide maximum compression so that the dependency of a heavy file on the storage cost is the minimum To provide the compression in the least possible time

1.5 Scope of the Application     Any industry which has to store huge files for long time To backup database. Provide cheap online storage Any application which has to reduce the storage cost.

6

JANG – A Text File Compressor

SYSTEM STUDY

2

7

JANG – A Text File Compressor

2.
2.1

SYSTEM STUDY
Theory of System Analysis

Systems analysis is the interdisciplinary part of Science, dealing with analysis of sets of interacting entities, the systems, often prior to their automation as computer systems, and the interactions within those systems.

System analysis involves the identification of the objectives and the requirements, evalualtion of alternative solutions and recommendation for a more feasible solution. In other words, system analysis is a systematic process of gathering, recording and interpreting facts, It also includes studying the problems encountered in the present system and introducing a new system into an organization.

System analysis itself breaks into two stages : Preliminary and Detailed. During the preliminary analysis, the analyst lists the objectives of the proposed system. These findings come together in the preliminary report.

Once the preliminary report is approved, the system analysis phase advances into the next stage, the detailed analysis. During the detailed analysis, required data and information are collected and a detailed study is made.

2.2

Theory of Compression

2.2.1 Data compression overview

Data Compression shrinks down a file so that it takes up less space. This is desirable for data storage and data communication. Storage space on disks is expensive so a file which occupies less disk space is "cheaper" than an uncompressed file. Smaller files are also desirable for data communication, because the smaller a file the faster it can be transferred. A compressed file appears to increase the speed of data transfer over an uncompressed file.

8

JANG – A Text File Compressor 2.2.2 Types of data compression There are two main types of data compression: lossy and lossless. Lossy data compression or perceptual coding, is possible if some loss of fidelity is acceptable. After one applies lossy data compression to a message, the message can never be recovered exactly as it was before it was compressed. When the compressed message is decoded it does not give back the original message. Data has been lost. Because lossy compression can not be decoded to yield the exact original message, it is not a good method of compression for critical data, such as textual data.

Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely without error... In a lossless data compression file the original message can be exactly decoded. Lossless data compression works by finding repeated patterns in a message and encoding those patterns in an efficient manner. For this reason, lossless data compression is also referred to as redundancy reduction. Becuase redundancy reduction is dependent on patterns in the message, it does not work well on random messages. Lossless data compression is ideal for text. Most of the algorithms for lossless compression are based on the LZ compression method developed by Lempel and Ziv.

The idea of data compression is deeply connected with statistical inference. The entire thoery of compression is based on the algorithmic information theory and by rate-distortion theory created by Claude Shannon.

Many lossless data compression systems can be viewed in terms of a four-stage model. Lossy data compression systems typically include even more stages, including, for example, prediction, frequency transformation, and quantization. 2.3 Applications of Compression

The above is a very simple example of run-length encoding, wherein large runs of consecutive identical data values are replaced by a simple code with the data value and length of the run. This is an example of lossless data compression. It is often used to optimize disk space on office computers, or better use the connection bandwidth in a computer network. For symbolic data 9

JANG – A Text File Compressor such as spreadsheets, text, executable programs, etc., losslessness is essential because changing even a single bit cannot be tolerated (except in some limited cases). For visual and audio data, some loss of quality can be tolerated without losing the essential nature of the data. By taking advantage of the limitations of the human sensory system, a great deal of space can be saved while producing an output which is nearly indistinguishable from the original. These lossy data compression methods typically offer a three-way tradeoff between compression speed, compressed data size and quality loss. Lossy image compression is used in digital cameras, to increase storage capacities with minimal degradation of picture quality. Similarly, DVDs use the lossy MPEG-2 codec for video compression. In lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components of the signal. Compression of human speech is often performed with even more specialized techniques, so that "speech compression" or "voice coding" is sometimes distinguished as a separate discipline from "audio compression". Different audio and speech compression standards are listed under audio codecs. Voice compression is used in Internet telephony for example, while audio compression is used for CD ripping and is decoded by audio players.

2.4 Existing Algorithms

The very best compressors use probabilistic models which predictions are coupled to an algorithm called arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to the betterknown Huffman algorithm, and lends itself especially well to adaptive data compression tasks where the predictions are strongly context-dependent. Arithmetic coding is used in the bilevel image-compression standard JBIG, and the document-compression standard DjVu.

There is a close connection between machine learning and compression: a system that predicts the posterior probabilities of a sequence given its entire history can be used for optimal data compression (by using arithmetic coding on the output distribution), while an optimal compressor can be used for prediction (by finding the symbol that compresses best, given the 10

JANG – A Text File Compressor previous history). This equivalence has been used as justification for data compression as a benchmark for "general intelligence" 2.4.1 LZ77 LZ77 algorithms achieve compression by replacing portions of the data with references to matching data that have already passed through both encoder and decoder. A match is encoded by a pair of numbers called a length-distance pair, which is equivalent to the statement "each of the next length characters is equal to the character exactly distance characters behind it in the uncompressed stream." The encoder and decoder must both keep track of some amount of the most recent data, such as the last 2 kB, 4 kB, or 32 kB. The structure in which this data is held is called a sliding window, which is why LZ77 is sometimes called sliding window compression. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. This is why the encoder can use a smaller size sliding window than the decoder, but not vice-versa.

2.4.2 LZ78GCC LZ78GCC is often the compiler of choice for developing software that is required to execute on a wide variety of hardware and/or operating systems. System-specific compilers provided by hardware or OS vendors can differ substantially, complicating both the software's source code and the scripts which invoke the compiler to build it. With GCC, most of the compiler is the same on every platform, so only code which explicitly uses platform-specific features must be rewritten for each system. While the LZ77 algorithm works on past data, the LZ78 algorithm attempts to work on future data. It does this by forward scanning the input buffer and matching it against a dictionary it maintains. It will scan into the buffer until it cannot find a match in the dictionary. At this point it will output the location of the word in the dictionary, if one is available, the match length and the character that caused a match failure. The resulting word is then added to the dictionary.

2.4.3 Huffman Encoding

11

JANG – A Text File Compressor Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix that expresses the most common characters using shorter strings of bits than are used for less common source symbols. Huffman was able to design the most efficient compression method of this type: no other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code. A method was later found to do this in linear time if input probabilities are sorted.

2.4.4 Arithmetic Coding

Arithmetic coding is a method for lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code. Like Huffman coding, arithmetic coding is a form of variable-length entropy encoding that converts a string into another representation that represents frequently used characters using fewer bits and infrequently used characters using more bits, with the goal of using fewer bits in total. As opposed to other entropy encoding techniques that separate the input message into its component symbols and replace each symbol with a code word, arithmetic coding encodes the entire message into a single number, a fraction n where (0.0 ≤ n < 1.0).

2.5 Limitations of the Existing Systems

All the present systems make use of the data collected using the file to be compressed. These systems have been very successful. As a matter of fact there have also been competitive algorithms which make use of a combination of the existing algorithms which provide high degrees of compression and very fast and efficient decompression. There have also been algorithms which are optimized for a special operation, like the compression and decompression only.

2.6 Proposed System

12

JANG – A Text File Compressor Keeping the efficiency and performance of the existing systems in view, we are trying to implement a new algorithm based on similar concepts. This algorithm provides a lossless compression to the text files. This uses the data collected from previous analysis of the text files to generate a Static Dictionary containing words which are very frequently used in almost all of the files. An analysis of the input files, extracts words from it, which are not present in the static dictionary, and creates a dynamic dictionary. This file is essential in the decompression of the compressed file. The presence of each word in the input file will be replaced with a code that will represent the kind of the word and also the position at which it occurs in the respective dictionary. It also contains other meta information like the length, the capitalization information etc which are required for the lossless decompression that it offers. Though it contains many information, the output compressed file will be considerably smaller because the encoding standard that the application follows uses a minimal set of bits to represent each word. The bit pattern is optimized to a high degree so that the space that each pattern takes is the minimum.

2.7

Degree of Application of the JANG Compressor v1.0

This version of the JANG Compressor uses words that are from the English Language. Though it is applicable to any language, the compression ratio that it promises will not be as high as it would give to the file which uses the normal English Language. This is due to the reason that the static dictionary which contains the information about the most commonly used words are populated mainly from the analysis of the such files that has English words as their majority. This static dictionary is the backbone which paves path to provide the best compression.

This version has also given a least priority to the symbols that may be present in the file. Due to this, JANG Compressor v1.0 does not provide the normal benchmarking standards to a symbol intensive file.

The best case compression is provided when the files are large and has many repeating words.

13

JANG – A Text File Compressor The worst case compression is resulted with small and symbol intensive files.

2.8

Proposed Future Updates to JANG Compressor
    

Algorithm optimization to symbol dense files. Scaling to most of the languages. Parallelization of the appliaction to provide high performance. Web interface for the application to provide online compression. Automatic optimization of the dictionary to provide higher degree of performance.

2.9

Proposed Implementation

JANG Compressor is implented in its first phase in the highly efficient C Programming language. The features of C follows :

2.9.1 Theoretical features of C

C is an imperative (procedural) systems implementation language. It was designed to be compiled using a relatively straightforward and lucid compiler, to provide low-level access to memory, to provide language constructs that map efficiently to machine instructions, and to require minimal run-time support.C was therefore useful for many applications because of very high efficiency and we utilise that feature for the development of our compressor.The C Compiler version we use is the GNU C compiler (GCC ) that works on the open source platform.

2.9.2 Minimalism C is designed to provide high-level abstracts for all the native features of a general-purpose CPU, while at the same time allowing modularization, structure, and code re-use. C is somewhat strongly typed (emitting warnings or errors) but allows programmers to override types in the interests of flexibility, simplicity or performance; while being natural and well-defined in its interpretation of type overrides. 14

JANG – A Text File Compressor 2.9.3 Characteristics C has facilities for structured programming and allows lexical variable scope and recursion, while a static type system prevents many unintended operations. In C, all executable code is contained within functions. Function parameters are always passed by value. Pass-by-reference is achieved in C by explicitly passing pointer values. C program follows lucid formating and indentation.The most important of its characteristic being its ability to exhibit complex functionalities such as I/O, string manipulation, and mathematical functions consistently delegated to library routines.

2.9.4 GCC features GCC is often the compiler of choice for developing software that is required to execute on a wide variety of hardware and on various platforms when efficiency is the key factor. Systemspecific compilers provided by hardware or OS vendors can differ substantially, complicating both the software's source code and the scripts which invoke the compiler to build it. So we decided to choose GCC with no doubts due to the fact that most of the compiler is the same on every platform, so only code which explicitly uses platform-specific features must be rewritten for each system which would not have been possible if other C compilers were used.

2.10 Feasibility Study During system analysis, a feasibility study of the proposed system was carried out to see whether it was beneficial to the organization. Three key considerations that are involved in the feasibility study and the result of the feasibility study are given below.

2.10.1 Technical feasibility Technical feasibility around the existing environment and to what extent it can support the proposed system. While considering the technical factors of the organization that it presently haves, it is sufficient to implement the new system. The new system is using peripheral interface controller,so it is having a global scope.

2.10.2 Economic feasibility 15

JANG – A Text File Compressor Economic feasibility is the most frequently used for evaluating the effectiveness of the candidate system more commonly known as cost/benefit analysis, the procedure is to determine the benefits and savings that are expected from a candidate system and compare them with the existing system. If the benefits of the candidate system out weigh the existing, the decision is made to design and implement the system. 2.10.3 Operational feasibility People are inherently resistant to change, and computers have been known to facilitate change. An estimate should be made about the feedback of the customers about this system. Anyway the system with the usage of GPRS is very advantageous to the customers in all the sense. 2.11 System Requirements

2.11.1 Hardware requirements   Processor: Pentium III or higher RAM: 512 MB DDR2 or higher (preferable)

2.11.2 Platform requirements  Gnu/linux (Debian 3 or higher)  Gnu Compiler Collection version 4.3.2 or higher

16

JANG – A Text File Compressor

SYSTEM DESIGN
17

3

JANG – A Text File Compressor 3. SYSTEM DESIGN

The purpose of the system design phase is to plan a solution of the problem specified by the requirements document. The design activity results in three separate outputs-architecture design ,high level design and detailed design. Architecture focuses on looking at a system as a combination of many different components and how the interact with each other to produce the desired results. The high level design identifies the modules that should be built for developing the system and specifications of these modules. At the end of system design all the major data structures, file formats, output formats, etc are also fixed. In detailed design the internal logic of each of the modules is specified. The approach used for designing the JANG application is the top-down approach. A good plan of attack for designing the algorithm is to break down the task to be accomplished into a few sub tasks, decompose each of these sub-tasks into smaller sub-tasks, and so forth. Eventually the subtask becomes so small that they are trivial to implement. One of the advantages of dividing a programming task into sub-task is efficiency and easiness of algorithm design. Also different people can work on different sub-tasks.

3.1

Requirement analysis

Requirement analysis is done in order to understand the problem a software system is to solve. The emphasis in requirement analysis is on identifying what is needed from the system, not how the system will achieve its goals. ASCII encoding, which is the most commonly used format to represent textual content, uses a byte, i.e. 8 bits to represent a character. This type of encoding involves a large amount of redundancy, as it has to account for 256 possible values for each character. Our objective was to create an original compressed file format that removes this redundancy to a great extent. A compression algorithm was necessary to convert a normal ASCII file to this special format, and decompression algorithm to convert compressed file back to ASCII. A provision was also required to compute the percentage of compression and to run the compression and decompression utilities using a basic command line interface 18

JANG – A Text File Compressor

3.2

Benchmarking

Benchmarking is the process of comparing cost, cycle time, productivity etc. of the current system with widely accepted standard systems. It provides a snapshot of performance of the system. Benchmarking is most used to measure performance using a specific indicator resulting in a metric of performance that is then compared to others. In this case, we have compared JANG’s performance with leading compression formats like ZIP, RAR and GZIP. It is to be noted that while these industrial standards have extended their problem domain to include multimedia content such as images, video and other binary files, JANG focuses exclusively on text files.

3.3

Architecture

Architectural design provides the blueprint for design with necessary specifications of hardware, software, people or other resources. Multiple architectures are evaluated before one is selected. An architecture is evaluated considering the difference between architectural design values and intentions. We were initially required to select an operating platform. The first coding environment we considered was the Borland C++ compiler in a system running Microsoft Windows. However, owing to the lack of robustness observed and the proprietary nature of the software used, we were forced to migrate to a more reliable, open platform. Considering the benefits of Open Source software and our familiarity with the organization of the GNU/Linux operating system, the next environment we selected was GCC 4.3.2 (The GNU Compiler Collection) running on Ubuntu 8.10 (Intrepid Ibex).

3.4

High-level design:

Our next task was to identify the modules required to accomplish our twin goals of encryption and compression:

19

JANG – A Text File Compressor 3.4.1 Dictionary Generation: This module generates a dynamic dictionary from the given text file. The dictionary lists each word in the file as well as its frequency, i.e. the number of times it appears in the file. 3.4.2 Static Dictionary Creation: This module consists of using the Dictionary Generation module to obtain a static dictionary that mirrors the frequency of words as normally found in the English language. This is done by providing several novels, documents and other publications as input to the previous module. 3.4.3 Encoding: The encoding phase consists of encoding the file using our original algorithm, taking the text stream by word rather than by character. The output of this phase is the encoded bit stream encoded as text. 3.4.4 Compression: This consists of compressing the encrypted file to a fraction of its size, by removing the text encoding of the bit stream. The main output file has the extension ‘.JANG’ 3.4.5 Decompression: This phase simply encodes the bit stream as text once again in preparation for the decryption process. 3.4.6 Decoding: The decoding phase uses the JANG format specification to generate the original text file from the decompressed JANG file. The output text file must correspond exactly with the input.

3.5

DATAFLOW DIAGRAMS

JANG
(SHELL SCRIPT)

COMPRESS
(SHELL SCRIPT)

DECOMPRESS
(SHELL SCRIPT)

OTHERS
(SHELL SCRIPT)

20

JANG – A Text File Compressor

COMPRESSFILE
(SHELL SCRIPT)

INPUT FILE INTERMEDIATE FILE AND DYNAMIC DICTIONARY ASCII BIT PATTERN FILE

COMPRESSED FILE WITH DYNAMIC DICTIONARY AND PAD BIT INFORMATION

1-DIC_RELEASE

3-COMPRESS

2-ENCODE

DECOMPRESSFILE COMPRESSED FILE WITH DYNAMIC DICTIONARY AND PAD BIT INFORMATION ASCII INTERMEDIATE FILE 4-DECOMPRESS
(SHELL SCRIPT)

DECOMPRESSED TEXT FILE ASCII INTERMEDIATE FILE 5-DECODE

21

JANG – A Text File Compressor

3.6

Detailed Design

3.6.1 Dictionary Generation: Dictionary is the most important part of the algorithm which determines the degree of compression the algorithm provides. In the process, the analysis of the input file is done to extract all the word from it. Any punctuations or special symbols act as the delimiters for the words. These words are matched with those in the static dictionary. Any match found will break the operation and continue to the next word. If a match is not found, this word is entered into the junk file. Before entering into the junk file, it is matched with the existing entries in the junk. If a match is found then the counter associated with the word is incremented. On the other case the word is entered as the last entry with a count as 1. After each entry the file is sorted based on the frequency of each word. This process maintains the file sorted at every point of time. When all words are completed the junk file is split into various dictionaries categorized as the dynamic dictionaries. This dynamic dictionary is used in the further processes for the compression. The dynamic dictionary contains all the words in the input file which are not present in the static dictionaries. 3.6.2 Encoding Encoding, as stated earlier, is performed based on a word-based traversal of the input text file. We consider the text file as an alternating series of words and sequences of special characters. Each word is extracted along with the character immediately preceding it. The word is then searched for in the static and dynamic word indices, in that order. The actual files to be searched are chosen based on the number of characters in the word. This excludes the ultra-common and ultra-frequent word indices which contain words irrespective of length. It may be noted that every word (where ‘word’ is defined as a sequence of alphanumeric characters) must logically be present in one of the indices due to the presence of a dynamically generated word index. Once the word is found, coded representations of its frequency (either as a static or a dynamic word), capitalisation information, and the special symbol preceding it, are saved to an output file. A special character sequence is traversed character-by-character. Each character or series beginning with that character may be treated as one among the following: 22

JANG – A Text File Compressor


A common sequence: This includes frequently occurring sequences such as the ellipsis […], or even common emoticons [:-)]



A repeating character sequence: A sequence where the same character is repeated several times. A long series of spaces, hyphens etc. would qualify as such a sequence.



A special symbol: If the series beginning with the character does not fall under the above mentioned categories, it is identified as a single special symbol.

The corresponding coded representation is saved onto the file. The coded representation consists entirely of zeroes and ones and thus corresponds to the exact bits in the compressed JANG file obtained at the end of the compression stage. 3.6.3 Compression: The file produced by the previous stage contains the actual bits of the encoded output converted to text, so that the format may be examined at the bit level using a text-editor. The process merely involves taking each sequence of 8 characters (or so-called ‘bits’) and saving it as a single character in a new file that has the extension, ‘.JANG’. The last sequence thus encountered may have less than 8 characters. In this case, the no. of characters is saved to another file. This phase marks the end of the compression utility, or the first half of the system. 3.6.4 Decompression This phase marks the start of the decompression utility, or the second half of the system. It reverses the previous phase, and converts the JANG file back to the ASCII bit stream that was obtained as the output of the encryption phase. 3.6.5 Decoding This phase decodes the JANG format to re-obtain the originally supplied input text file. This is performed using the format specification that we shall provide here:

23

JANG – A Text File Compressor 3.7 Bit Pattern in the compressed file The bit pattern in the compressed file depends mainly on the category of the index file to which they belong to. The bit pattern is represented as follows Bit 0 and 1 0 01 10 11 : : : : Common Word Dictionary word Special Symbol Blob category word

For the pattern: 00 or 01 Bit 2 and 3 00 01 10 11 : : : : Ultra common / Ultra frequent very common / Very Frequent Common / Frequent Hardly Ever / Not Frequent

For the pattern : 00 00 Bit 4 to 11 : The code of the word in the ultra frequent index file.

For the pattern : 00 01 Bits 4 to 7 : Represent the length of the word. gives the capitalization information : : : : small case capital case sentence case mixed case

Bits 8 and 9 : 00 01 10 11

Note: For mixed case the next ‘x’ number of bits, where ‘x’ is the length of the word, will be reserved for the representation of the mixed caps. 0 representing the small letters and the 1 representing the capital letters. Bits 10 to 13 : used to represent the previous character of the word.

Note: with 4 bits we can represent a set of 16 more symbols that are used commonly. In case the symbol is none of these, then the next 7 bits would actually represent the correct index in the file that has all the special symbols. 24

JANG – A Text File Compressor Bits 14 to 24 : represent the index of the word from the file that it belongs to.

These are the same cases with the dictionary words(ultra frequent and the very frequent). On all the other words, the length will be represented using 5 bits and the code of the word is represented using 12 bit long pattern.

For the pattern: 10 Bits 2 and 3: 00: 01: 10: 11: Special Symbol Common Sequence Repeating Character Sequence Not Used

For the pattern: 10 00 and 10 10 Bits 4 to 11: ASCII code for the special sequence For the pattern: 10 00 and 10 10 Bits 4 to 11: ASCII code for the special sequence

3.8

Testing Testing is one of the most crucial phases of the system development. It is the phase where

errors remaining from all previous phases must bee detected. It is a major quality measure which is employed after coding.

3.8.1 Testing Fundamentals 3.8.1.1 Error, Fault and failure Error referrers to discrepancy between a computed, observed or measured value and the true, specified or theoretically correct value. Error is also used to refer human actions that results in software containing a defect or a fault. Fault is a condition that causes a system to fail in performing its required function. Failure is the inability of a system or component t o perform a required function according to a given specification.

25

JANG – A Text File Compressor 3.8.1.2 Test Oracles It is the mechanism to check the correctness of the output of the program for test cases. The output of the test oracle and program under testing is compared to determine the behavior of the program for test cases. Testing principles and objectives
  

all tests should be traceable to the software requirements. Tests should be planned well before test begins Goal of testing activity should be to maximize the number of errors detected and to minimize the test cases.(minimize cost)



Primary objective of selecting test cases is to ensure that if there is an error or fault it si exerised by one of the test cases.

   

Testing should be reliable and valid. Good test case should have a higher probability of finding a yet undiscovered error. Exhaustive testing is no possible. Two aspects of test case selection are :- specifying the criteria for evaluating a set of test cases and generating a set of test cases that satisfy a given criteria.

There are different types of testing invloved in a usual testing pprocedure which can mostly be summarized to the following 3 types of testing. 3.8.1.3 Unit Testing this is the starting point of testing. In this, a module is tested separately and is often performed by the coder himself simultaneously along with coding of the module. The purpose is to detect coding errors in different parts of the module.The goal is to test the internal logic of the modules. Structural testing best suited for this level. 3.8.1.4 Integration testing After unit testing the modules are gradually integrated into subsystems , which are then integrated to eventually form th entire system. During integration of modules, integration testing 26

JANG – A Text File Compressor is performed to detect design errors by focusing on testing the interconnection between the modules. The goal here is to test the integrity of the system. 3.8.1.5 System testing After the system is put together, the system testing is performed. The current system is tried against the system requirements are met and if it conforms to general levels of acceptance.

3.8.2 Testing the application The application was tested in every phases of its development. The testing details are done as follows. 3.8.2.1 Unit Testing Each module described above are tested after the development. Each module had identified respective input files and output files. A sample file was created as input file manually. This file is then used as the input to the module and the output was also verified manually. The typical errors that were encountered were the file pointer errors and array memory exceeding limit errors. These errors were rectified using a log file to locate the exact location of the error and also by printing appropriate messages at regular intervals. 3.8.2.2 Integration Testing In the development of the project we thoroughly differentiated between the different modules. Each module produces a temporary file which is used as the input to the next file. Due to this there were not much of integration problems encountered during the course of development. 3.8.2.3 System Testing After the successful integration testing, the whole application was implemented in a system without much of errors. 3.8.2.4 Tools used in testing Normal error messages prompted by the GC Compiler. Strace command in GNU/Linux File comparison applications like ‘Beyond Compare’ to compare the input and output files

27

JANG – A Text File Compressor

IMPLEMENTATION
28

4

JANG – A Text File Compressor

4.

IMPLEMENTATION

4.1

Application Files

The application was designed to be implemented to work from the command line of a shell application in GNU/Linux. For the same the installation procedures were done which copied the required files to appropriate locations in the file system. The functions of each of the required files in the application are as follows 4.1.1 Shell scripts A shell script is a script written for the shell or command line interpreter, of an operating system used to perform operations like file manipulation, program execution, and printing text. Jang.sh Shell script that receives arguments based on the users choice to perform the required operations like compress and decompress and transfer the execution to the respective shell script. Compressfile.sh Shell script that performs the required call to executables required for performing the file compression procedure. Decompressfile.sh Shell script that seeks the call to executables used to perform the decompression.

4.1.2 Executable files 1-dic_release Performs the operation of dynamic dictionary generation. 2-encode Generates an ASCII bit stream based on the static and dynamic dictionaries. 3-compress Converts the ASCII encoded Hex pattern to a bit stream. 4-decompress Converts the Hex coded file to ASCII encoded bit stream.Output is an ASCII encoded file. 5-decode Generates the decompressed original version of the text file based on the static and dynamic dictionaries along with ASCII encoded file. 29

JANG – A Text File Compressor 4.1.3 Other files Help.txt This file contains the help information. Version.txt This file contains the version information. Credits.txt This file contains the credit information.

4.2
   

INSTALLATION Copy ‘compress.sh’, ‘decompress.sh’, ‘jang.sh‘ to /usr/bin/ Copy folder jangexec containing the executable files to /usr/bin/ Copy the folder static containing static dictionaries to /usr/bin. Copy the folder jangfiles containing the documentation files to /usr/bin.

4.3

EXECUTION The application can be invoked using the command ‘jang.sh’. The various options

available are Auxiliary Information jang.sh - - [help|version|credits] Compression jang.sh –c <inputfilename> Decompression Jang.sh –d < compressed folder with the extension .dout > <output file name>

30

JANG – A Text File Compressor

CONCLUSION
31

5

JANG – A Text File Compressor 5. CONCLUSION

The JANG Compressor, though still in its infancy, has been found to provide a compression ratio of around 30% quite consistently. Given the fact that most programming languages use ASCII encoding, as do web pages and other documents, we believe that JANG has a promising future. In text-heavy web sites, for instance, the JANG format may be used to reduce the amount of content to be transferred, provided that the browser has a JANG decompressor that may be a built-in feature of the browser, or just a plug-in or add-on. This could prove to be a boon for slow and low-bandwidth connections. Source code compression is another place where JANG may be used in its present form. Our compressor may be a solution for the massive amount of code that developers working at different sites need to send one another. Web sites such as Project Gutenberg (a storehouse of fiction and non-fiction books in the public domain), would definitely benefit from compressing their content. JANG also affords a measure of security, as the decompressor relies on the dynamic dictionaries generated to decode the compressed .JANG file. The format would thus be ideal for transferring sensitive information. Our team’s experience in working on this project has been profound. We have advanced through the various stages of the life-cycle of a software system: from the initial plan, the proverbial ‘light-bulb’, through the painstakingly careful level of design and the more active stages of coding and testing to the final, rewarding phase of implementation. We have dealt with the physical, intellectual and external challenges that this undertaking has brought us, and emerged successful. We present our mini-project, JANG version 1.0, in the hope that it shall move on to ever greater heights.

32

JANG – A Text File Compressor

REFERENCES
33

6

JANG – A Text File Compressor

6. REFERENCES
[1]. William von hagen, ”A Defenitive guide to GCC”, Appress edition,vol 2, ISBN 9781590595855, 1590595858 [2]. E Balaguruswamy, ”Programming in ANSI C”, Tata Mc-graw hill, 2004,ISBN 0070534772,9780070534773 [3]. Cameron Newham, ”Basic Shell Programming”, O’reilly- vol 3,ISBN 0070474498,7630873275431 [4]. [5]. [6]. wikipedia.org gcc.gnu.org csie.ntu.edu.tw

34

JANG – A Text File Compressor

APPENDIX

7

35

JANG – A Text File Compressor 7.1 APPENDIX A : Program Flow Chart

Start

Extract the Key word, without any spaces or punctuations

Create a junk file with keyword separated by ’#’

Extract each word from the junk file and obtain the frequency

Sort the file according to the frequency

Split this sorted file into different classes of the dictionary Dictionary creation completed

Extract the Keywords from the input file

Yes

A word ?

No

Find the category in which it belong to.

Find the sequence code number in the symbol file

Get the sequence code

Make the actual representation

Make the representation of the word.

A

36

JANG – A Text File Compressor

A

Write the word to the file each bit in the binary format in ASCII

Group the ASCII into 4 and find the actual representation as a single ASCII code

Produce the final compressed file

Stop

Fig 7.1 Data and Control Flow in the Compression Phase

37

JANG – A Text File Compressor

38

JANG – A Text File Compressor

39

JANG – A Text File Compressor

Fig 7.2 40

Data and Control Flow in the Decompression Phase

JANG – A Text File Compressor 7.2 APPENDIX B : Source Codes

7.2.1 DICTIONARY GENERATION
/* This program uses the keywords separated by '#' from a file named 'outputfile.txt' and forms the dictionary. the output file is named as per the dictionary names*/ #include<stdio.h> #include<string.h> #include"stringfuncs.h" #include<stdlib.h> #define REF_SIZE 100 #define UF_SIZE 256 #define VF_SIZE 1024 #define F_SIZE 4096 #define NF_SIZE 16384 #define MAX_WORD_SIZE 500 int reorder(long int); int compare(const void *a, const void *b); int extract_keywords(); int chk_static(char temp[]); char *toString(int); void create_dictionary(); int max_len=0; long int uf_cnt=0; float indicator=1.0,global_count=1.0,perc_compl=1.0; char t_c; char outString[100],inp_file[100]; int main(int argc,char *argv[]) { FILE *ip,*op,*cnt_fil; long int i,j,k,l,m,found=0,rewind,ref_count=0,t_ref_cnt=0,ref_no=0,ref_ptr[1000],frequency=1,pr ev_frequency=1; /* rewind is used to retrace the poniter back to a position i,j,k are used as variables in the loops l,m are used as the index varialble for arrays found is used to signify if the keyword is found in the static dictionary

41

JANG – A Text File Compressor
ref_count is used as the index for the ref_ptr ref_no is an incremental variable used to check if the no of words used is less than REF_SIZE ref_ptr is used to get the pointer for each block for reordering prev_frequency is used to store the frequency of the word before the word under consideration and it is used to */ int close_flag=0; /* close_flag is a boolean variable used to */ char c,d,temp[100],key[100]; /* c,d are the variable that reads the character from the files; temp contains the string that is taken from the input file; key is used to store the word exrtracted from the junk_dic.txt */ if(argc==1) { puts("File Expected : Not Found"); exit (1); } else { printf("Reading File '%s'\n",argv[1]); } extract_keywords(argv[1]); /* this function extracts the words without any other symbols */ //puts("cREATING tHE jUNK dICTIONARY"); ip=fopen("temp/outputfile.txt","r"); /* outputfile.txt has the keywords in the input file separated by # */ op=fopen("temp/junk_dic.txt","w"); fclose(op); /*to clear the junk dictionay*/ cnt_fil=fopen("temp/glob_cnt.txt","r"); /* this file contains the total number of words in the inputfile decide if the block has to be reordered or not frequency keeps track of the changed frequency of the word

42

JANG – A Text File Compressor
*/ fscanf(cnt_fil,"%f",&global_count); fclose(cnt_fil); //printf("count %f",global_count); c=fgetc(ip); if(c!='#' || !ip) { printf("Sorry! Invalid input file"); } else { c=fgetc(ip); //printf("Reading Files\nCompleted :\n"); while(c!=EOF) { perc_compl=(indicator/(float)global_count); //find the percentage completed. it is calculated as the present word count divided by total number of words if(perc_compl!=1.00) printf("\b\b\b\b\b\b%.2f",perc_compl*100); indicator=indicator+1.0; //Increment the word count l=0; if(c=='#') c=fgetc(ip); /*increment the pointer if '#' is observed*/ while(c!='#' && c!=EOF) { temp[l++]=c; c=fgetc(ip); } temp[l]='\0'; /*ending characer for the array temp.*/ /*--segement 1 start--------------------check whether the word is present in the static dictionary----------------------------*/ if(!chk_static(temp)) /*if the word is not present in the static dictionary*/ { ref_count=0;

43

JANG – A Text File Compressor
op=fopen("temp/junk_dic.txt","r+"); d=fgetc(op); ref_ptr[ref_count]=ftell(op); /*set the pionter to the block of words*/ ref_count++; if(!op) printf("Error opening the file for writing"); else if(d==EOF) /*this is the first word in the junk_dictionary*/ { frequency=1; fprintf(op,"#%09ld:%s#",frequency,temp); fflush(op); fclose(op); } else { /*this is not the first word in the dictionary. Here each word in the dicationary is checked to match wiht the word extracred and if a match is found in the dictionary then the frequency of the word is changed. If the modified frequency is greater than the frequency of the word preceeding it then the whole of the dictionary from the present block, represented by the ref_ptr array, is reordered to keep the junk_dic always sorterd at the end of each change. function 'redorder' is used for reordering */ ref_no=0; while(d!=EOF) { m=0; found=0; if(d=='#') d=fgetc(op); prev_frequency=frequency; /*set the revious frequency */ fscanf(op,"%ld",&frequency); /*get the new frequency*/ d=fgetc(op); if(d==':') d=fgetc(op); while(d!='#' && d!=EOF) { key[m++]=d; d=fgetc(op);

44

JANG – A Text File Compressor
} key[m]='\0'; ref_no++; if(ref_no==REF_SIZE-1) /* if the no of blocks allocated is complete then set the pointer to the present word to indicate the starint of a new block*/ { ref_no=0; ref_ptr[ref_count]=ftell(op); ref_count++; } if(h_strcmp(key,temp)==0 && key[0]!='\0') { /* if the word from the input file is present in the junk_dic */ found=1; frequency++; rewind=h_strlen(key)+11; fseek(op,-rewind,SEEK_CUR); fprintf(op,"%09ld",frequency); fflush(op); d=fgetc(op); close_flag=0; if(prev_frequency<frequency) { /*the frequency of the present word is greater than the previous one and hence reordering is required*/ while(ref_count>0) { /*reorder from the present block to the first one*/ t_ref_cnt=--ref_count; reorder(ref_ptr[t_ref_cnt]); } ref_count=0;

45

JANG – A Text File Compressor
} close_flag=1; break; } } if(d==EOF && found==0 && temp[0]!='\0') { { /* if the word is not present in the junk_dic and hence we have to add it to the junk_dic */ frequency=1; fprintf(op,"%09ld:%s#",frequency,temp); fflush(op); } } fclose(op); } } /*----segment 1 ends-----------------------------------checking and addition or modifoaction to junk_dic.txt is cmplete------------------*/ } } create_dictionary(); /*create the dictionary with the junk_dic created*/ //puts("....Phase 1 Complete...."); /*----segment 0 start-------this segment simply checks the junk_dic and the uf.txt and if they are empty then they are set with '##'------*/ op=fopen("temp/junk_dic.txt","r+"); c=fgetc(op); if(c==EOF) { fprintf(op,"##"); } fclose(op); op=fopen("dynamic/uf.txt","r+"); c=fgetc(op); if(c==EOF) { fprintf(op,"##"); }

46

JANG – A Text File Compressor
fclose(op); /*----segment 0 end--------------------------------------------------segement ends here---------------------------------------------------*/ return 0; } /*----segment 2 start-----this function reorders the block of words and then rewrites it to the file-----------*/ int reorder(long int ptr) { FILE *op; char d,block_arr[REF_SIZE][10000]; long int ref_no=0; int i=0; op=fopen("temp/junk_dic.txt","r+"); if(op==NULL) { puts("File Open Error"); exit(0); } if(fseek(op,ptr,SEEK_SET)) /*set the file pointer to the ptr*/ { puts("Error Seeking the position"); exit(0); } d=fgetc(op); while(d!=EOF && ref_no!=REF_SIZE) { /*get the complete block of words to the block_arr*/ while(d!='#' && d!=EOF) { block_arr[ref_no][i++]=d; d=fgetc(op); } block_arr[ref_no][i++]='\0'; i=0; ref_no++; d=fgetc(op); }

47

JANG – A Text File Compressor
qsort ((void *)block_arr,ref_no,sizeof(block_arr[0]),compare); /*this is the default sorting functoin which is implemented using the quick sort algorithm*/ if(fseek(op,ptr,SEEK_SET)) { puts("Error Seeking the position"); exit(0); } /*this block writes the ordered block back into the junk_dic*/ for(i=(ref_no-1);i>=0;i--) { fprintf(op,"%s#",block_arr[i]); fflush(op); //fgetc(op); } fclose(op); return 0; } /*----segment 2 ends---the junk_dic is modified---------------------------------------------------------------------------*/ /*----segment 3 starts-- this is used to compare the strings in the qsort function----------------------------------------*/ int compare(const void *a, const void *b) { int ret_val; ret_val=strcmp((char *)a,(char *)b) ; return(ret_val); } /*----segment 3 ends------------------------------------------------------------------------------------------------------*/ /*----segment 4 starts-----------this is used to generate the different dicitonaries based on the junk_dic-----------------*/ void create_dictionary() { FILE *ip,*uf,*fp; long int vf_index[MAX_WORD_SIZE],f_index[MAX_WORD_SIZE],nf_index[MAX_WORD_SIZE];

48

JANG – A Text File Compressor
/*these arrays store the count of wordds entered in each file*/ int i,j,k,copy,len,q; /* i,j,k are the variables used for the loops copy is a binary vairable which is set when ':' observer to indicate that the characters followed is a keyword in the file len has the length of the keyword */ char e,key_word [MAX_WORD_SIZE],ch_len[5],file_name[100]; /* e is a temporary character used to read the characte from the input file */ for(i=0;i<MAX_WORD_SIZE;i++) /*set every counf index to 0*/ { vf_index[i]=0; f_index[i]=0; nf_index[i]=0; } ip=fopen("temp/junk_dic.txt","r"); uf=fopen("dynamic/uf.txt","w"); e=fgetc(ip); if((e!='#' && e!=EOF)) { puts("Sorry Invalid File Format"); exit(0); } while(e!=EOF) { i=0; copy=0; e=fgetc(ip); while(e!='#' && e!=EOF) { /* get he keyword after removing the count. Till the ':' is observed just keep read all the characters */ if(e!=':'&& copy==1)

49

JANG – A Text File Compressor
{ key_word[i++]=e; e=fgetc(ip); } else { if(e==':') copy=1; e=fgetc(ip); } } key_word[i++]='\0'; copy=0; len=h_strlen(key_word); if(len>max_len) { max_len=len; } if(uf_cnt<UF_SIZE) { fprintf(uf,"#%s",key_word); uf_cnt++; fflush(uf); } else { for(q=0;q<h_strlen(key_word);q++) { if(!(key_word[q]>='0'&& key_word[q]<='9') && !(key_word[q]>='a'&& key_word[q]<='z')) key_word[q]=key_word[q]+32; } if(vf_index[len]<VF_SIZE && len<=15) { h_strcpy(file_name,"dynamic/vf"); vf_index[len]++; } else if(f_index[len]<F_SIZE && len<=31) {

50

JANG – A Text File Compressor
h_strcpy(file_name,"dynamic/f"); f_index[len]++; } else if(nf_index[len]<NF_SIZE && len<=63) { h_strcpy(file_name,"dynamic/nf"); nf_index[len]++; } h_strcpy(ch_len,toString(len)); h_strcat(file_name,ch_len); h_strcat(file_name,".txt"); fp=fopen(file_name,"a"); if(fp==NULL) { puts("Cannot open the fp file pointer to write the lenght specific dictionary"); } fprintf(fp,"#%s",key_word); fflush(fp); fclose(fp); } } fclose(uf); /*----segment 5--this segment creates all the files till the max_len for linear search. This will be eliminated in the future versions*/ for(i=1;i<=max_len;i++) { h_strcpy(file_name,"dynamic/vf"); h_strcpy(ch_len,toString(i)); h_strcat(file_name,ch_len); h_strcat(file_name,".txt"); fp=fopen(file_name,"r"); if(fp==NULL) { fp=fopen(file_name,"w"); fprintf(fp,"##"); fflush(fp); } else

51

JANG – A Text File Compressor
{ fclose(fp); fp=fopen(file_name,"a"); fprintf(fp,"#"); } fclose(fp); } for(i=1;i<=max_len;i++) { h_strcpy(file_name,"dynamic/f"); h_strcpy(ch_len,toString(i)); h_strcat(file_name,ch_len); h_strcat(file_name,".txt"); fp=fopen(file_name,"r"); if(fp==NULL) { fp=fopen(file_name,"w"); fprintf(fp,"##"); fflush(fp); } else { fclose(fp); fp=fopen(file_name,"a"); fprintf(fp,"#"); } fclose(fp); } for(i=1;i<=max_len;i++) { h_strcpy(file_name,"dynamic/nf"); h_strcpy(ch_len,toString(i)); h_strcat(file_name,ch_len); h_strcat(file_name,".txt"); fp=fopen(file_name,"r"); if(fp==NULL) { fp=fopen(file_name,"w"); fprintf(fp,"##"); fflush(fp); }

52

JANG – A Text File Compressor
else { fclose(fp); fp=fopen(file_name,"a"); fprintf(fp,"#"); } fclose(fp); } /*----segment 5 ends----------------------------------------------------------------------------------------------------*/ printf("\b\b\b\b\b\b\b\b100.00"); //puts("\nDictionary Generation Complete"); } char *toString(int num) { outString[0]='\0'; int rev, dig; int i; rev = 5; while(num > 0) { dig = num % 10; rev = (rev * 10) + dig; num = num/10; } i=0; while(rev > 5) { dig = rev % 10; dig = dig + 48; outString[i++] = dig; rev = rev/10; } outString[i] = '\0'; return outString; } int chk_static(char temp[]) { FILE *chk_f; char file_name[100],c,chk_temp[1000],ch_len[100];

53

JANG – A Text File Compressor
int i,k,chk_found=0; long int j; chk_f=fopen("static/uc.txt","r"); fflush(chk_f); if(!chk_f) printf("Error Opening the uc file"); c=fgetc(chk_f); while(c!=EOF && chk_found==0) { c=fgetc(chk_f); j=0; while(c!=EOF && c!='#') { chk_temp[j++]=c; c=fgetc(chk_f); } chk_temp[j++]='\0'; if(h_strcmp(temp,chk_temp)==0) { fclose(chk_f); return 1; } } fclose(chk_f); fflush(chk_f); i=h_strlen(temp); h_strcpy(file_name,"static/c"); h_strcpy(ch_len,toString(i)); h_strcat(file_name,ch_len); h_strcat(file_name,".txt"); chk_f=fopen(file_name,"r"); if(!chk_f) printf("Error Opening the %s file",file_name); c=fgetc(chk_f); while(c!=EOF && chk_found==0) { c=fgetc(chk_f); j=0; while(c!=EOF && c!='#') { chk_temp[j++]=c; c=fgetc(chk_f);

54

JANG – A Text File Compressor
} chk_temp[j++]='\0'; if(h_stricmp(temp,chk_temp)==0) { fclose(chk_f); return 1; } } fclose(chk_f); h_strcpy(file_name,"static/vc"); h_strcpy(ch_len,toString(i)); h_strcat(file_name,ch_len); h_strcat(file_name,".txt"); chk_f=fopen(file_name,"r"); if(!chk_f) printf("Error Opening the %s file",file_name); c=fgetc(chk_f); while(c!=EOF && chk_found==0) { c=fgetc(chk_f); j=0; while(c!=EOF && c!='#') { chk_temp[j++]=c; c=fgetc(chk_f); } chk_temp[j++]='\0'; if(h_stricmp(temp,chk_temp)==0) { fclose(chk_f); return 1; } } fclose(chk_f); h_strcpy(file_name,"static/he"); h_strcpy(ch_len,toString(i)); h_strcat(file_name,ch_len); h_strcat(file_name,".txt"); chk_f=fopen(file_name,"r"); if(!chk_f) printf("Error Opening the %s file",file_name);

55

JANG – A Text File Compressor
c=fgetc(chk_f); while(c!=EOF && chk_found==0) { c=fgetc(chk_f); j=0; while(c!=EOF && c!='#') { chk_temp[j++]=c; c=fgetc(chk_f); } chk_temp[j++]='\0'; if(h_stricmp(temp,chk_temp)==0) { fclose(chk_f); return 1; } } fclose(chk_f); return 0; } int extract_keywords( char inp_file[]) { FILE *ifp,*ofp; /*InitialiZation ifp : Input file pointer ofp : Output file pointer key_word : stores the intermediate word before adding to the output buffer */ char c; char key_word[100]; long int count=0,global_count=0; ifp=fopen(inp_file,"r"); ofp=fopen("temp/outputfile.txt","w"); while(!feof(ifp)) { c=fgetc(ifp); /* negelct all the charaters till an alpbabet or a number is reached*/ if(!((c>='a' && c<='z')||( c>='A' && c<='Z') || (c>='0' && c<='9'))&& c!=EOF) {

56

JANG – A Text File Compressor
while(!(((c>='a' && c<='z')||( c>='A' && c<='Z') || (c>='0' && c<='9')))&& c!=EOF) { c=fgetc(ifp); } } count=0;/* reset the index for the key_word*/ /* get the key word/*/ global_count++; while(((c>='a' && c<='z')||( c>='A' && c<='Z') || (c>='0' && c<='9')) && c!=EOF && c!=' ') { key_word[count]=c; count++; c=fgetc(ifp); } key_word[count++]='\0'; if(count>1) { fprintf(ofp,"#%s",key_word);/* printing the keyword to the output file*/ fflush(ofp); } } fclose(ifp); fclose(ofp); ifp=fopen("temp/glob_cnt.txt","w"); fprintf(ifp,"%ld",global_count); fclose(ifp); return 1; }

7.2.2 ENCODE
#include<stdio.h> #include<string.h> #include<stdlib.h> #include"stringfuncs.h" #define SEQ_LEN_CHARS 64 #define MAX_WORD_LEN 64 int isAlNum(char); int isBlob(char);

57

JANG – A Text File Compressor
int isSmall(char); int isBig(char); char *toBinaryText(int, int); char *toString(int); long int seekWord(char *, char *, int); int cool = 0; FILE *ipStream, *opStream, *percStream; char binText[100]; char outString[100]; int main(int argc, char *argv[]) { fpos_t rightPos; int charCount, lettCount, noOfSpecSeqs; int specNo; char specTemp[100]; char fileName[100], lenString[100]; char fileStart[100][100]; char preText[100], opText[100], wordText[100], checkSeq[100], sameSeq[100]; char lenText[100], specText[100], codeText[100], capText[100]; char c; char blobText[8]; char specSeq[20][20]; //Scope for change int incremPLen, seqLen, typeFixed; int wordLen; long int wordPos; int specNum; int capType; int encryptFlag; int i,j,k,l,m,n,p; float wordIndicator = 1.0, globCount, percCompleted; if(argc < 2) { printf("\nFile Name Expected: Not Found! "); exit(1); } strcpy(specSeq[0], "+--"); noOfSpecSeqs = 1; strcpy(fileStart[0], "static/uc"); strcpy(fileStart[1], "static/vc"); strcpy(fileStart[2], "static/c"); strcpy(fileStart[3], "static/he"); strcpy(fileStart[4], "dynamic/uf");

58

JANG – A Text File Compressor
strcpy(fileStart[5], "dynamic/vf"); strcpy(fileStart[6], "dynamic/f"); strcpy(fileStart[7], "dynamic/nf"); ipStream = fopen(argv[1], "r"); opStream = fopen("temp/output.txt", "w"); percStream = fopen("temp/glob_cnt.txt", "r"); if(!ipStream) { printf("\n There's been a problem opening the input file! Exiting..."); exit(1); } if(!opStream) { printf("\n There's been a problem opening the output file! Exiting..."); exit(1); } if(!percStream) { printf("\n There's been a problem opening the status file! Exiting..."); exit(1); } c = fgetc(ipStream); fscanf(percStream,"%f",&globCount); fclose(percStream); while(!feof(ipStream)) { percCompleted = (wordIndicator/(float)globCount); if(percCompleted != 1.00) printf("\b\b\b\b\b\b%.2f", percCompleted * 100); wordIndicator = wordIndicator+1.0; charCount = 0; while(!isAlNum(c) && !feof(ipStream) && charCount < SEQ_LEN_CHARS) { preText[charCount++] = c; c = fgetc(ipStream); } preText[charCount] = '\0'; if(feof(ipStream) || charCount == SEQ_LEN_CHARS) charCount++; p = 0; while(p < charCount - 1)

59

JANG – A Text File Compressor
{ incremPLen = 1; typeFixed = 0; if(isBlob(preText[p])) { fprintf(opStream, "11"); strcpy(opText, toBinaryText(preText[p]+128, 8)); fprintf(opStream, "%s", opText); fflush(opStream); typeFixed = 1; } else { for(i=0;i<noOfSpecSeqs;i++) { seqLen = strlen(specSeq[i]); if((charCount - 1) - p >= seqLen) { for(j=p;j<p+seqLen;j++) { checkSeq[j-p] = preText[j]; } checkSeq[j-p] = '\0'; if(strcmp(checkSeq, specSeq[i]) == 0) { fprintf(opStream, "1010"); strcpy(opText, toBinaryText(i, 6)); fprintf(opStream, "%s", opText); fflush(opStream); incremPLen = seqLen; typeFixed = 1; break; } } } if(typeFixed == 0) { for(j=p; j < charCount-1 && (j-p) < SEQ_LEN_CHARS; j++) { if(preText[j] == preText[p]) {

60

JANG – A Text File Compressor
sameSeq[p-j] = preText[p]; } else break; } sameSeq[j-p] = '\0'; seqLen = j-p; if(seqLen > 2) { fprintf(opStream, "1001"); strcpy(opText, toBinaryText(seqLen, 6)); fprintf(opStream, "%s", opText); strcpy(opText, toBinaryText(preText[p]+128, 8)); change fflush(opStream); incremPLen = seqLen; typeFixed = 1; } if(typeFixed == 0) { fprintf(opStream, "1000"); strcpy(opText, toBinaryText(preText[p]+128, 8)); change fflush(opStream); typeFixed = 1; } } } p += incremPLen; } if(charCount == 0) { specNum = 14; //Start of file or long letter sequence } else { //Scope for change fprintf(opStream, "%s", opText); // Scope for //Scope for change fprintf(opStream, "%s", opText); // Scope for

61

JANG – A Text File Compressor
if(isBlob(preText[charCount-1])) { fprintf(opStream, "11"); strcpy(opText, toBinaryText(preText[charCount-1]+128, 8)); fprintf(opStream, "%s", opText); fflush(opStream); specNum = 13; } else { switch(preText[charCount-1]) { case ' ': specNum = 0; break; case '.': specNum = 1; break; case ',': specNum = 2; break; case ';': specNum = 3; break; case '\'': specNum = 4; break; case '\"': specNum = 5; break; case '(': specNum = 6; break; case '*': specNum = 7; break; case '-': specNum = 8; break; case '_': specNum = 9; break; case '$': specNum = 10; break; case ':': specNum = 11; break; case '[': specNum = 12; break; default: { specNum = 15; } } if(specNum != 15) { } } } if(specNum == 15) { strcpy(specText, "1111"); strcpy(specTemp, toBinaryText(preText[charCount-1]+128, 8)); strcat(specText, specTemp); } else

62

JANG – A Text File Compressor
{ strcpy(specText, toBinaryText(specNum, 4)); } lettCount = 0; while(isAlNum(c) && !feof(ipStream) && lettCount < MAX_WORD_LEN) { wordText[lettCount++] = c; c = fgetc(ipStream); } wordText[lettCount] = '\0'; if(lettCount == 0) { continue; } //Searching the various dictionaries begins right here. wordLen = strlen(wordText); l = 0; encryptFlag = 0; do { if(l==0) { strcpy(fileName, "static/uc.txt"); } else if(l == 4) { strcpy(fileName, "dynamic/uf.txt"); } else { strcpy(lenString, toString(wordLen)); switch(l) { case 1: case 2: case 3: case 5: case 6: case 7: strcpy(fileName, "static/vc"); break; strcpy(fileName, "static/c"); break; strcpy(fileName, "static/he"); break; strcpy(fileName, "dynamic/vf"); break; strcpy(fileName, "dynamic/f"); break; strcpy(fileName, "dynamic/nf"); break; //i.e. If there is no alphabet or letter, esp. when the special symbol sequence is more than the expected length (64 chars)

63

JANG – A Text File Compressor
} strcat(fileName, lenString); strcat(fileName, ".txt"); } if(l==0 || l==4) { wordPos = seekWord(fileName, wordText, 0); } else wordPos = seekWord(fileName, wordText, 1); if(wordPos != -999) { if(l < 4) { fprintf(opStream, "00"); strcpy(opText, toBinaryText(l, 2)); fprintf(opStream, "%s", opText); fflush(opStream); } else { fprintf(opStream, "01"); strcpy(opText, toBinaryText(l-4, 2)); fprintf(opStream, "%s", opText); fflush(opStream); } if(l != 0 && l != 4) { if(isSmall(wordText[0])) { capType = 0; for(m=1; m < wordLen; m++) { if(!isSmall(wordText[m])) { capType = 3; break; } } } else {

64

JANG – A Text File Compressor
if(isSmall(wordText[1])) { capType = 2; for(m=1; m < wordLen; m++) { if(!isSmall(wordText[m])) { capType = 3; break; } } } else { capType = 1; for(m=1; m < wordLen; m++) { if(!isBig(wordText[m])) { capType = 3; break; } } } } if(capType == 3) { capText[0] = '1'; capText[1] = '1'; n=0; while(n < wordLen) { if(isSmall(wordText[n])) capText[n+2] = '0'; else capText[n+2] = '1'; n++;

65

JANG – A Text File Compressor
} capText[n+2] = '\0'; } else { strcpy(capText, toBinaryText(capType, 2)); } } switch(l) { case 0: { strcpy(opText, toBinaryText(wordPos, 8)); fprintf(opStream, "%s", specText); fprintf(opStream, "%s", opText); fflush(opStream); break; } case 1: { strcpy(lenText, toBinaryText(wordLen, 4)); strcpy(codeText, toBinaryText(wordPos, 10)); break; } case 2: { strcpy(lenText, toBinaryText(wordLen, 5)); strcpy(codeText, toBinaryText(wordPos, 12)); break; } case 3: { strcpy(lenText, toBinaryText(wordLen, 6)); strcpy(codeText, toBinaryText(wordPos, 14)); break; } case 4: { strcpy(opText, toBinaryText(wordPos, 8)); fprintf(opStream, "%s", specText); fprintf(opStream, "%s", opText); fflush(opStream);

66

JANG – A Text File Compressor
break; } case 5: { strcpy(lenText, toBinaryText(wordLen, 4)); strcpy(codeText, toBinaryText(wordPos, 10)); break; } case 6: { strcpy(lenText, toBinaryText(wordLen, 5)); strcpy(codeText, toBinaryText(wordPos, 12)); break; } case 7: { strcpy(lenText, toBinaryText(wordLen, 6)); strcpy(codeText, toBinaryText(wordPos, 14)); break; } } if(l != 0 && l != 4) { fprintf(opStream, "%s", lenText); fprintf(opStream, "%s", capText); fprintf(opStream, "%s", specText); fprintf(opStream, "%s", codeText); fflush(opStream); } encryptFlag = 1; break; } else { l++; } }while(l<8); } fclose(ipStream); fclose(opStream); return 0; }

67

JANG – A Text File Compressor
int isAlNum(char c) { if((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9')) return 1; return 0; } int isBlob(char c) { if(c < 0 || c == 7 || c == 127) return 1; return 0; } int isSmall(char c) { if((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9')) return 1; return 0; } int isBig(char c) { if((c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9')) return 1; return 0; } char *toBinaryText(int num, int limit) { int i, dig; for(i=0;i<limit;i++) { binText[i] = '0'; } binText[i] = '\0'; i = limit - 1; while(num > 0) { dig = num % 2; binText[i--] = dig + 48; num = num / 2; } return binText; }

68

JANG – A Text File Compressor
long int seekWord(char *file, char *query, int mode) { long int finalPos = -999; FILE *searchStream; char wordie[100]; int i; long int count = 0; char e; searchStream = fopen(file, "r"); if(!searchStream) { fclose(ipStream); fclose(opStream); exit(1); } e = fgetc(searchStream); while(!feof(searchStream)) { if(e == '#') { i=0; e = fgetc(searchStream); while(e!='#' && !feof(searchStream)) { wordie[i++] = e; e = fgetc(searchStream); } wordie[i] = '\0'; count++; if((mode == 0 && strcmp(wordie, query) == 0) || (mode == 1 && h_stricmp(wordie, query) == 0)) { finalPos = count - 1; //Makes the file index zero-based break; } } }

69

JANG – A Text File Compressor
if(finalPos == -999) { } else { } fclose(searchStream); return finalPos; } char *toString(int num) { int rev, dig; int i; rev = 5; while(num > 0) { dig = num % 10; rev = (rev * 10) + dig; num = num/10; } i=0; while(rev > 5) { dig = rev % 10; dig = dig + 48; outString[i++] = dig; rev = rev/10; } outString[i] = '\0'; return outString; }

7.2.3 COMPRESS
#include<stdio.h> #include<stdlib.h> #include<string.h> char *getNextByte(FILE *); long int toDecimalNum(char *);

70

JANG – A Text File Compressor
long int raiseTo(int, int); char sequence[100]; int main(int argc, char *argv[]) { FILE *asciiStream, *hexStream, *extraStream, *noByteStream; char byteText[10]; char extraBits = 0; char outName[100]; int decNum; int i; strcpy(outName, argv[1]); outName[strlen(outName)-4] = '\0'; strcat(outName, ".jang"); asciiStream = fopen("temp/output.txt", "r"); hexStream = fopen(outName, "wb"); extraStream = fopen("temp/extrabits.txt", "w"); if(!asciiStream) { exit(1); } if(!hexStream) { exit(1); } if(!extraStream) { exit(1); } while(!feof(asciiStream)) { strcpy(byteText, getNextByte(asciiStream)); if(strlen(byteText) < 8) { extraBits = 8 - strlen(byteText); for(i=strlen(byteText); i<8; i++) { byteText[i] = '0'; } byteText[i] = '\0'; } decNum = toDecimalNum(byteText);

71

JANG – A Text File Compressor
decNum -= 128; fprintf(hexStream, "%c", decNum); fflush(hexStream); } fclose(asciiStream); fclose(hexStream); fprintf(extraStream, "%d", extraBits); fclose(extraStream); return 0; } char *getNextByte(FILE *fp) { char c; int i; i=0; while(i<8) { c = fgetc(fp); if(feof(fp)) { break; } sequence[i++] = c; } sequence[i] = '\0'; return sequence; } long int toDecimalNum(char *binText) { int binLen, i; long int decAns; binLen = strlen(binText); decAns = 0; for(i=binLen-1;i>=0;i--) { if(binText[i]-48 == 1) { decAns = decAns + raiseTo(2, (binLen - 1 - i));

72

JANG – A Text File Compressor
} } return decAns; } long int raiseTo(int num, int exp) { int i; long int ans; ans = 1; for(i=1;i<=exp;i++) { ans = ans * num; } return ans; }

7.2.4 DECOMPRESS
#include<stdio.h> #include<stdlib.h> #include<string.h> char *toBinaryText(int num, int limit); char binText[100]; int main(int argc, char *argv[]) { FILE *asciiStream, *hexStream, *extraStream; char byteText[10], prevByte[10]; int extraBits, lastBit; char c; int d; int decNum; int i; asciiStream = fopen("temp/output2.txt", "w"); hexStream = fopen(argv[1], "rb"); extraStream = fopen("temp/extrabits.txt", "r"); if(argc < 2) { printf("\n\nJANG File Name Expected! Yours sucks!"); exit(1); } if(!asciiStream)

73

JANG – A Text File Compressor
{ printf("\nError! Can't open file!"); exit(1); } if(!hexStream) { printf("\nError! Can't open file!"); exit(1); } if(!extraStream) { printf("\nError! Can't open file!"); exit(1); } fscanf(extraStream, "%d", &extraBits); if(extraBits < 8) lastBit = 8 - extraBits; else lastBit = 8; fclose(extraStream); strcpy(prevByte, "\0"); strcpy(byteText, "\0"); while(!feof(hexStream)) { c = fgetc(hexStream); d = c + 128; strcpy(prevByte, byteText); if(feof(hexStream)) { prevByte[lastBit] = '\0'; } else { strcpy(byteText, toBinaryText(d, 8)); } fprintf(asciiStream, "%s", prevByte);

74

JANG – A Text File Compressor
fflush(asciiStream); } fclose(hexStream); fclose(asciiStream); return 0; } char *toBinaryText(int num, int limit) { int i, dig; for(i=0;i<limit;i++) { binText[i] = '0'; } binText[i] = '\0'; i = limit - 1; while(num > 0) { dig = num % 2; binText[i--] = dig + 48; num = num / 2; } return binText; }

7.2.5 DECODE
#include<stdio.h> #include<string.h> #include<stdlib.h> char *getSequence(FILE *, int); long int toDecimalNum(char *); long int raiseTo(int num, int exp); char *getWord(char *, long int); char *toString(int); FILE *textStream, *binStream; char outString[100]; char foundWord[100]; char sequence[100]; int main(int argc, char *argv[]) {

75

JANG – A Text File Compressor
char typeBits[100], comnessBits[100], freqBits[100], specTypeBits[100], positBits[100], lengthBits[100], capTypeBits[100], prevCharBits[100], prevCharCodeBits[100], singCharBits[100], seqLenBits[100], specSeqBits[100], blobBits[100]; int typeNum, comnessNum, freqNum, specTypeNum, positNum, lengthNum, capTypeNum, prevCharNum, seqLenNum, specSeqNum; char capArray[100]; char prevChar[10], singChar[10], blob[10]; long int prevCharCode, singCharCode, blobCode; int i,j,k; char lenString[10]; char fileStart[100][100]; char specSeq[20][20]; char fileName[100]; int codeSpecBits, lenSpecBits; int noOfSpecSeqs; char printWord[100]; char specCharArray[] = " .,;\'\"(*-_$:["; strcpy(specSeq[0], "+--"); noOfSpecSeqs = 1; strcpy(fileStart[0], "static/uc"); strcpy(fileStart[1], "static/vc"); strcpy(fileStart[2], "static/c"); strcpy(fileStart[3], "static/he"); strcpy(fileStart[4], "dynamic/uf"); strcpy(fileStart[5], "dynamic/vf"); strcpy(fileStart[6], "dynamic/f"); strcpy(fileStart[7], "dynamic/nf"); binStream = fopen("temp/output2.txt", "r"); textStream = fopen(argv[1], "w"); if(!binStream) { exit(1); } if(!textStream) { exit(1); } if(argc < 2) { printf("\nOutput File Name Required! Not Found! ");

76

JANG – A Text File Compressor
exit(1); } while(!feof(binStream)) { strcpy(typeBits, getSequence(binStream, 2)); typeNum = toDecimalNum(typeBits); switch(typeNum) { case 0: { strcpy(comnessBits, getSequence(binStream, 2)); comnessNum = toDecimalNum(comnessBits); switch(comnessNum) { case 0: { codeSpecBits = 8; strcpy(fileName, "static/uc.txt"); break; } case 1: { lenSpecBits = 4; codeSpecBits = 10; break; } case 2: { lenSpecBits = 5; codeSpecBits = 12; break; } case 3: { lenSpecBits = 6; codeSpecBits = 14; break; } } if(comnessNum != 0) {

77

JANG – A Text File Compressor
strcpy(lengthBits, getSequence(binStream, lenSpecBits)); lengthNum = toDecimalNum(lengthBits); strcpy(lenString, toString(lengthNum)); strcpy(fileName, fileStart[comnessNum]); strcat(fileName, lenString); strcat(fileName, ".txt"); strcpy(capTypeBits, getSequence(binStream, 2)); capTypeNum = toDecimalNum(capTypeBits); switch(capTypeNum) { case 0: { for(j=0;j<lengthNum;j++) capArray[j] = '0'; break; } case 2: { for(j=0;j<lengthNum;j++) { if(j == 0) capArray[j] = '1'; else capArray[j] = '0'; } break; } case 1: { for(j=0;j<lengthNum;j++) capArray[j] = '1'; break; } case 3: { strcpy(capArray, getSequence(binStream, lengthNum)); break;

78

JANG – A Text File Compressor
} } } strcpy(prevCharBits, getSequence(binStream, 4)); prevCharNum = toDecimalNum(prevCharBits); if(prevCharNum < 13) { prevChar[0] = specCharArray[prevCharNum]; prevChar[1] = '\0'; } else if(prevCharNum == 13) { prevChar[0] = '\0'; } else if(prevCharNum == 14) { prevChar[0] = '\0'; } else { strcpy(prevCharCodeBits, getSequence(binStream, 8)); prevCharCode = toDecimalNum(prevCharCodeBits); prevCharCode -= 128; prevChar[0] = prevCharCode; prevChar[1] = '\0'; } fprintf(textStream, "%s", prevChar); fflush(textStream); strcpy(positBits, getSequence(binStream, codeSpecBits)); positNum = toDecimalNum(positBits); strcpy(printWord, getWord(fileName, positNum+1)); // +1 since the dictionaries are zero-based if(comnessNum != 0) { for(j=0;j<lengthNum;j++) { if(capArray[j] == '1') { if(!(printWord[j] >= '0' && printWord[j] <= '9'))

79

JANG – A Text File Compressor
printWord[j] -= 32; } } } fprintf(textStream, "%s", printWord); fflush(textStream); break; } case 1: { strcpy(freqBits, getSequence(binStream, 2)); freqNum = toDecimalNum(freqBits); switch(freqNum) { case 0: { codeSpecBits = 8; strcpy(fileName, "dynamic/uf.txt"); break; } case 1: { lenSpecBits = 4; codeSpecBits = 10; break; } case 2: { lenSpecBits = 5; codeSpecBits = 12; break; } case 3: { lenSpecBits = 6; codeSpecBits = 14; break; } } if(freqNum != 0)

80

JANG – A Text File Compressor
{ strcpy(lengthBits, getSequence(binStream, lenSpecBits)); lengthNum = toDecimalNum(lengthBits); strcpy(lenString, toString(lengthNum)); strcpy(fileName, fileStart[freqNum + 4]); strcat(fileName, lenString); strcat(fileName, ".txt"); strcpy(capTypeBits, getSequence(binStream, 2)); capTypeNum = toDecimalNum(capTypeBits); switch(capTypeNum) { case 0: { for(j=0;j<lengthNum;j++) capArray[j] = '0'; break; } case 2: { for(j=0;j<lengthNum;j++) { if(j == 0) capArray[j] = '1'; else capArray[j] = '0'; } break; } case 1: { for(j=0;j<lengthNum;j++) capArray[j] = '1'; break; } case 3: { strcpy(capArray, getSequence(binStream, lengthNum)); break; }

81

JANG – A Text File Compressor
} } strcpy(prevCharBits, getSequence(binStream, 4)); prevCharNum = toDecimalNum(prevCharBits); if(prevCharNum < 13) { prevChar[0] = specCharArray[prevCharNum]; prevChar[1] = '\0'; } else if(prevCharNum == 13) { prevChar[0] = '\0'; } else if(prevCharNum == 14) { prevChar[0] = '\0'; } else { strcpy(prevCharBits, getSequence(binStream, 8)); prevCharCode = toDecimalNum(prevCharBits); prevCharCode -= 128; prevChar[0] = prevCharCode; prevChar[1] = '\0'; } fprintf(textStream, "%s", prevChar); fflush(textStream); strcpy(positBits, getSequence(binStream, codeSpecBits)); positNum = toDecimalNum(positBits); strcpy(printWord, getWord(fileName, positNum+1)); // +1 since the dictionaries are zero-based if(freqNum != 0) { for(j=0;j<lengthNum;j++) { if(capArray[j] == '1') { if(!(printWord[j] >= '0' && printWord[j] <= '9'))

82

JANG – A Text File Compressor
printWord[j] -= 32; } } } fprintf(textStream, "%s", printWord); break; } case 2: { strcpy(specTypeBits, getSequence(binStream, 2)); specTypeNum = toDecimalNum(specTypeBits); switch(specTypeNum) { case 0: { strcpy(singCharBits, getSequence(binStream, 8)); singCharCode = toDecimalNum(singCharBits); singCharCode -= 128; singChar[0] = singCharCode; singChar[1] = '\0'; fprintf(textStream, "%s", singChar); fflush(textStream); break; } case 1: { strcpy(seqLenBits, getSequence(binStream, 6)); seqLenNum = toDecimalNum(seqLenBits); strcpy(singCharBits, getSequence(binStream, 8)); singCharCode = toDecimalNum(singCharBits); singCharCode -= 128; singChar[0] = singCharCode; singChar[1] = '\0'; for(j=1;j<=seqLenNum;j++) { fprintf(textStream, "%s", singChar); fflush(textStream);

83

JANG – A Text File Compressor
} break; } case 2: { strcpy(specSeqBits, getSequence(binStream, 6)); specSeqNum = toDecimalNum(specSeqBits); fprintf(textStream, "%s", specSeq[specSeqNum]); fflush(textStream); break; } case 3: { break; } } break; } case 3: { strcpy(blobBits, getSequence(binStream, 8)); blobCode = toDecimalNum(blobBits); blobCode -= 128; blob[0] = blobCode; blob[1] = '\0'; fprintf(textStream, "%s", blob); fflush(textStream); break; } } } fclose(binStream); fclose(textStream); return 0;

84

JANG – A Text File Compressor
} char *getSequence(FILE *fp, int length) { char c; int i; i=0; while(i<length) { c = fgetc(fp); if(feof(fp)) { exit(1); } sequence[i++] = c; } sequence[i] = '\0'; return sequence; } long int toDecimalNum(char *binText) { int binLen, i; long int decAns; binLen = strlen(binText); decAns = 0; for(i=binLen-1;i>=0;i--) { if(binText[i]-48 == 1) { decAns = decAns + raiseTo(2, (binLen - 1 - i)); } } return decAns; } long int raiseTo(int num, int exp) { int i; long int ans;

85

JANG – A Text File Compressor
ans = 1; for(i=1;i<=exp;i++) { ans = ans * num; } return ans; } char *getWord(char *file, long int pos) { FILE *fp; char d, e; int i; int wordFound = 0; long int hashCount = 0; fp = fopen(file, "r"); if(!fp) { fclose(textStream); fclose(binStream); exit(1); } while(!feof(fp)) { d = fgetc(fp); if(d=='#') { hashCount++; if(hashCount == pos) { wordFound = 1; break; } } } if(wordFound == 1) { i=0; while(!feof(fp)) { e = fgetc(fp);

86

JANG – A Text File Compressor
if(e=='#') break; foundWord[i++] = e; } foundWord[i] = '\0'; } fclose(fp); return foundWord; } char *toString(int num) { int rev, dig; int i; rev = 5; while(num > 0) { dig = num % 10; rev = (rev * 10) + dig; num = num/10; } i=0; while(rev > 5) { dig = rev % 10; dig = dig + 48; outString[i++] = dig; rev = rev/10; } outString[i] = '\0'; return outString; }

7.2.6 SIZECHECK
#include<stdio.h> #include<string.h> #include<stdlib.h> long int getFileSize(FILE *); char *toString(int);

87

JANG – A Text File Compressor
char outString[100]; char inputFile[100]; FILE *sizeStream; int main(int argc, char *argv[]) { int i; char fileName[100]; long int totalSize, inputSize; float percCompressn; if(argc==1) { puts("Filename Expected: Not Found"); exit(1); } totalSize = 0; sizeStream = fopen("dynamic/uf.txt", "r"); totalSize += getFileSize(sizeStream); fclose(sizeStream); strcpy(fileName, "dynamic/vf"); for(i=1;i<=63;i++) { strcat(fileName, toString(i)); strcat(fileName, ".txt"); sizeStream = fopen(fileName, "r"); if(!sizeStream) { break; } totalSize += getFileSize(sizeStream); fclose(sizeStream); } strcpy(fileName, "dynamic/f"); for(i=1;i<=63;i++) { strcat(fileName, toString(i)); strcat(fileName, ".txt"); sizeStream = fopen(fileName, "r"); if(!sizeStream) { break;

88

JANG – A Text File Compressor
} totalSize += getFileSize(sizeStream); fclose(sizeStream); } strcpy(fileName, "dynamic/nf"); for(i=1;i<=63;i++) { strcat(fileName, toString(i)); strcat(fileName, ".txt"); sizeStream = fopen(fileName, "r"); if(!sizeStream) { break; } totalSize += getFileSize(sizeStream); fclose(sizeStream); } strcpy(inputFile,argv[1]); inputFile[strlen(inputFile)-4]='\0'; strcat(inputFile,".jang"); sizeStream = fopen(inputFile, "r"); totalSize += getFileSize(sizeStream); fclose(sizeStream); sizeStream = fopen(argv[1], "r"); inputSize = getFileSize(sizeStream); fclose(sizeStream); percCompressn = (float)(inputSize - totalSize)/inputSize * 100; printf("========================================"); printf("\nCompression Stats:\n"); printf("========================================\n\n"); printf("Input File Size: %ld Bytes\n", inputSize); printf("Output Size: %ld Bytes\n", totalSize); printf("Compression: %.2f percent\n", percCompressn); printf("========================================\n\n"); return 0; } char *toString(int num) { int rev, dig; int i; rev = 5; while(num > 0)

89

JANG – A Text File Compressor
{ dig = num % 10; rev = (rev * 10) + dig; num = num/10; } i=0; while(rev > 5) { dig = rev % 10; dig = dig + 48; outString[i++] = dig; rev = rev/10; } outString[i] = '\0'; return outString; } long int getFileSize(FILE *stream) { long int characCount=0; char charac; while(!feof(stream)) { charac = fgetc(stream); characCount++; } return (characCount); }

7.3

SHELL SCRIPTS

7.3.1 jang.sh
#********************************************************************************* #Script Name # # # # #Author #!/bin/sh : JANG (TM) : JANG.sh -help help -ver -cre display version details display credits #Optional Input Parameters

90

JANG – A Text File Compressor
# read command line arguements #set -x echo $2 cp -f $2 /tmp/$2 if [ ! -e "/tmp/$2" ] then echo "File Not Copied" fi if [ $# -eq 0 ] then echo "Arguments Not Found" echo "Try './JANG.sh --help' for More Options" fi if [ "$1" = "--help" ] then echo "Help Required" cat "/usr/bin/JANGFILES/help.txt" elif [ "$1" = "--credits" ] then echo "Creditss" cat "/usr/bin/JANGFILES/credits.txt" elif [ "$1" = "--info" ] then echo "Version Information" cat "/usr/bin/JANGFILES/version.txt" elif [ "$1" = "-c" ] then if [ $# -ne 2 ] then echo "Argument Expected : Not Found" exit 1 elif [ ! -e $2 ] then echo "File '$2' NOT FOUND" exit 1 else /usr/bin/compressfile.sh $2 fi

91

JANG – A Text File Compressor
elif [ "$1" = "-d" ] then if [ $# -ne 3 ] then echo "Arguments Expected : Not Found" exit 1 elif [ ! -e $2 ] then echo "File '$2' NOT FOUND" exit 1 else /usr/bin/decompressfile.sh $2 $3 fi fi

7.3.2 compressfile.sh
#!/bin/sh echo "" echo "========================================" echo "JANG Compressor" echo "========================================" echo "" echo "----------------------------------------" echo "Phase 1" echo "----------------------------------------" echo "Initializing Files and Folders" rm -rf /tmp/dynamic #rm -rf logs rm -rf /tmp/temp mkdir /tmp/dynamic #mkdir logs mkdir /tmp/temp echo "Initalization Procedures Complete" echo "----------------------------------------" echo "Phase 2" echo "----------------------------------------" echo "Generating Dictionaries" /usr/bin/jangexec/1-dic_release $1 echo ""

92

JANG – A Text File Compressor
echo "Dictionaries Generated" echo "----------------------------------------" echo "Phase 3" echo "----------------------------------------" echo "Encoding the File" /usr/bin/jangexec/2-encode $1 echo "" echo "File Encoded" echo "----------------------------------------" echo "Phase 4" echo "----------------------------------------" echo "Compressing File" /usr/bin/jangexec/3-compress $1 echo "100" echo "Compression Complete" echo "----------------------------------------" echo "Phase 5" echo "----------------------------------------" echo "Final Exercise" PRINAM=`echo $1 | cut -d'.' -f 1` rm -rf /tmp/$PRINAM".dout" mkdir /tmp/$PRINAM".dout" mkdir /tmp/$PRINAM".dout"/dynamic cp -r /tmp/dynamic/ /tmp/$PRINAM".dout"/ cp /tmp/$PRINAM".jang" /tmp/$PRINAM".dout" echo "----------------------------------------" /usr/bin/jangexec/sizecheck $1 PWD= pwd cp -f /tmp/temp/extrabits.txt $PWD/$PRINAM.dout/extrabits.txt cp -rf "/tmp/$PRINAM.dout/" $PWD/ rm -rf "/tmp/"$PRINAM".dout"; rm -rf "/tmp/"$PRINAM".jang"; rm -rf "/tmp/dynamic/" rm -rf "/tmp/temp/" rm -rf "/tmp/$1"

93

JANG – A Text File Compressor 7.3.3 decompressfile.sh
#!/bin/sh mkdir /tmp/temp PWD= pwd echo "" echo "========================================" echo "JANG Decompressor" echo "========================================" echo "" echo "----------------------------------------" echo "Phase 1" echo "----------------------------------------" echo "Verifying Arguments" if [ $# -ne 2 ] then echo "Invalid Arguments" echo "----------------------------------------" exit 1 fi cp -rf $PWD/$1/ /tmp/ PRINAM=`echo $1 | cut -d'.' -f 1` echo "Verification Complete" echo "----------------------------------------" echo "Phase 2" echo "----------------------------------------" echo "Decompressing..." /usr/bin/jangexec/4-decompress $PRINAM".jang" echo "100.00" echo "Decompression Complete" echo "----------------------------------------" echo "Phase 3" echo "----------------------------------------" echo "Decoding..." /usr/bin/jangexec/5-decode $1 $2 echo "Decoding Complete" echo "========================================"

94

JANG – A Text File Compressor
cp -rf /tmp/temp/$2 $PWD/ rm -rf /tmp/dynamic/ rm -rf /tmp/$1/ rm -rf /tmp/temp/

7.3.4 Screen Shots COMPRESSION

95

JANG – A Text File Compressor

96

JANG – A Text File Compressor DECOMPRESSION

97

JANG – A Text File Compressor

COMPARISON

98


								
To top