Tuesday, April 28, 2015

Splitting of Large Text Files

Opening up large log files from a production environment can be difficult (or rather impossible depending on your system resources) with apps or editors like notepad or vi.  There are plenty of free apps out there to split text files up, but you can't install them if you don't have admin priviledges (which was my case for a Windows 7 laptop).

Luckily, I did have cygwin installed, so I could run some unix commands to help with my dilemna.  I found some great articles online, but ultimately the following was the command that I ended up using
that helped me a great deal.

awk '{outfile=sprintf("HUGEFILE.log.2015-04-28-%02d.txt",NR/2000000+1);print > outfile}' HUGEFILE.log.2015-04-28

The HUGEFILE.log.2015-04-28 needs to be replaced twice.  The first one is the output file(s) name, which is the filename + a number indicator if you will have multiple output files.  The second is the filename you are splitting.

NR is an awk built-in variable for the number of records.  For example: if HUGEFILE.log.2015-04-28 file had 162 lines in it, and we replaced 2000000 with 50, we would get four output files.  The first three would have 50 lines in it (with an additional empty line at the end from the +1), and the last one 12 lines (with an additional empty line at the end from the +1).

The 2000000 is the number of lines that I determined my editor (cygwin vi) could open in a relatively reasonable amount of time.  The 2000000 number of lines is approximately a 500 MB size file.

So after running the above command on my 4.2GB original file, I got 9 output files of ~500MB each, except the last file which was 140MB.  A lot easier to open those files to peruse the logs!

No comments:

Post a Comment

I appreciate your time in leaving a comment!