• The VOIDRUNNER'S CODEX is coming! Explore new worlds, fight oppressive empires, fend off fearsome aliens, and wield deadly psionics with this comprehensive boxed set expansion for 5E and A5E!

in search for a ( decent ) analyst programmer, project : The Unlimited Compressor

le Redoutable

Ich bin El Glouglou :)
the goal is to create from a homogeneous Winzip-type file a heterogeneous file which will then be eligible to a new Winzip-file, with a gain even as low as 1-2%;

what is a homogeneous file ?
it is a file where all Byte values are represented with a ratio of say 1/180 to 1/300 ( the ideal would be 1/256 if occurences of each value were perfectly homogeneous )
what is a heterogeneous file ?
it is a file where some values appear more often than some others;
for example, text files are essentially composed of occurences of values from 32 to 128 ( or so )

so, here's my method :

first, some statistics :
find the value with the most occurences;
as I printed above, the most occurent value will give a ratio ( for example ) of 1/180;
that means you can use offsets for each occurence of that value within a Byte ( because statistics say offsets shouldn't exceed 180, then 255 ( the max value you can print within a Byte ) should rarely get exceeded;
still sometimes you may end up with an offset of ( 280, 400, or even 850 ) , so you can easily rule that , if you print an offset of 255 it means the offset is equal to 254 + another Byte of 0 to 254 , which again if equal to 255 means you have an offset of 254+254+ another Byte etc
The only problem with appending too much offset values is it adds to the length of the output file ( well, beginning with the most occurent Value somehow mitigates this problem )

ok.

here's the idea :
in lieu of Byte values you use offsets for each Byte value ( in the order of from the most occurent Byte Value down to the less occurent Byte Value )
then, as you print offsets you put a flag where in the original file you located the said offset;
then, each time you check for occurences ( that is, because there are 256 values from 0 to 255 , you will do 256 times the job ) , each time you find a flag you don't add to the offset for the n-th value
quickly an example for a file of 20 Bytes , composed of 6 Values ( 39, 44, 11, 18, 74, 78 ):
01 39
02 44
03 39
04 11
05 18
06 18
07 11
08 78
09 39
10 11
11 44
12 39
13 18
14 11
15 11
16 11
17 74
18 44
19 78
20 39

first, the statistics :
39 5
44 3
11 6
18 3
74 1
78 2

sorted ( and printed to the output file ) :
1 11
2 39
3 44
4 18
5 78
6 74

now look at this :
01 39 +
02 44 +
03 39 +
04 11 + ( offset is 04 - 00 = 4 )
05 18 +
06 18 +
07 11 + ( offset is 07 - 04 = 3 )
08 78 +
09 39 +
10 11 + ( offset is 10 - 07 = 3 )
11 44 +
12 39 +
13 18 +
14 11 + ( offset is 14 - 10 = 4 )
15 11 + ( offset is 15 - 14 = 1 )
16 11 + ( offset is 16 - 15 = 1 )
17 74
18 44
19 78
20 39

so the output file looks like :
4
3
3
4
1
1

next value ( 39 ) :
01 39 +1
02 44 +
03 39 +2
04 11 . (here's a flag )
05 18 +
06 18 +
07 11 .
08 78 +
09 39 +4 ( 09 - 03 , -1-for-flag-at-04 ,-1-for-flag-at-07 )
10 11 .
11 44 +
12 39 +3-1 = 2
13 18 +
14 11 .
15 11 .
16 11 .
17 74 +
18 44 +
19 78 +
20 39 +8-3 = 5

adding to the output file :
1
2
4
2
5

next value ( 44 ) :
01 39 .
02 44 1
03 39 .
04 11 .
05 18 +
06 18 +
07 11 .
08 78 +
09 39 .
10 11 .
11 44 4
12 39 .
13 18 +
14 11 .
15 11 .
16 11 .
17 74 +
18 44 3
19 78
20 39

adding to the output file :
1
4
3

next value ( 18 ) :
01 39 .
02 44 .
03 39 .
04 11 .
05 18 1
06 18 1
07 11 .
08 78 +
09 39 .
10 11 .
11 44 .
12 39 .
13 18 2
14 11
15 11
16 11
17 74
18 44
19 78
20 39

adding to the output file :
1
1
2

etc

note that as you advance in the less common values , the offsets become low ( and that's exactly what the program is for )
in a huge file you should end up with a lot more of low values ( like 001 , 050, 030 etc ) than big ones ( 220, 190 etc )
here's what I call a heterogeneous file :)

if my vision is correct, you will be able to Winzip the output file, giving birth to a new zip file, which in turn will be re-heterogeneoused, for even a 1% gain ( but repeated 1.000 times ( or 1.000.000 times if you want to transfer a 1GB file to a floppy 720 ko lol )

so, where am I wrong ?
 

log in or register to remove this ad

Remove ads

Top