Randomizing Text File StreamReader/Writer

guest2013-1

guest
Joined
Aug 22, 2003
Messages
19,800
Line count isn't the problem, however, as stated beforehand, to determine line count I have to run through the file once before to determine that. I'm already doing that to setup the progress bar feedback I have going with the background worker, so I can just reuse that number as I'm doing when I actually split the files.

The 3.1 gig file takes about 40 minutes to process which includes re-ordering the initial data columns, stripping any invalid HTML characters etc and delimiting it correctly (into CSV) and splitting it into 60k odd files with a line count of 55 each.

The use of this isn't what I can discuss here right now, but it's specific to the company and it's users, so necessary to have in there. Reading out a bunch of lines and randomizing only those lines and then combining them after won't be possible because with as big a dataset as the 3.1 gig file, I'm sure to run into several thousand lines starting with "A"

I like the bin idea. And I'm contemplating implementing this while I process the regex filter on the data itself. So while the file iterates through and processes the columns, the processed line would then be pushed into a randomized bin order of files, and combined into one file after everything is processed. This could then possibly give me a more random feel than what I currently have without incurring more resources or making the process slower.

3.1gig is the extreme though by the way... usually the files are under 2mb big. But I'd rather optimize for a huge file than later down the line going "oh crap, the **** is slow because I never knew it would be that big". So far the only time disk IO is impacted is the initial read of the file to process and when writing out thousands of tiny split up files. That's just for a couple of minutes though and the most RAM I use is 80mb, drops down to 7-15mb when finished.
 

FarligOpptreden

Executive Member
Joined
Mar 5, 2007
Messages
5,396
Just a question - is this supposed to act as some kind of flat-file database or what? If so, is it not possible to convince them that a RDMS would be a better option?

At this point I'm guessing the answer to my question is no, because you would've thought of and tried that already, knowing you... :p
 

guest2013-1

guest
Joined
Aug 22, 2003
Messages
19,800
Just a question - is this supposed to act as some kind of flat-file database or what? If so, is it not possible to convince them that a RDMS would be a better option?

At this point I'm guessing the answer to my question is no, because you would've thought of and tried that already, knowing you... :p

;) I would've fought them to the DEATH if the issue could be solved with a simple database. Nope, not using it as a flat file database. I'm using it for different types of SEO/SEM strategies the client wants to implement, I get a bunch of text files and from there I need to do my shiat. Obviously, working with a 3gig file would be insane, splitting it up would be "okay", but then several things in this "strategy" they want me to implement would conflict with each other. Plus, the data they give me is real crap to begin with, so I need to make do.

Hands up who has ever received data from a client (or two) who expects you to perform miracles with it but expects stuff to "just work"?
 

dequadin

Expert Member
Joined
May 9, 2008
Messages
1,434
Line count isn't the problem, however, as stated beforehand, to determine line count I have to run through the file once before to determine that. I'm already doing that to setup the progress bar feedback I have going with the background worker, so I can just reuse that number as I'm doing when I actually split the files.

Another way of showing progress back to the user, would be to use a "bytes processed" type of idea. You know the initial file size , and then based on what encoding you're using in your input file (ASCII, UTF-16, etc) you can calculate you % processed based on how much data you've parsed. That way you don't have to parse the entire file to get an initial line count, in order to calculate your progress.

The 3.1 gig file takes about 40 minutes to process which includes re-ordering the initial data columns, stripping any invalid HTML characters etc and delimiting it correctly (into CSV) and splitting it into 60k odd files with a line count of 55 each.

I suggest you take a look at these free libraries, they may save you a lot of work:
FileHelpers v 2.0
Html Agility Pack

3.1gig is the extreme though by the way... usually the files are under 2mb big. But I'd rather optimize for a huge file than later down the line going "oh crap, the **** is slow because I never knew it would be that big". So far the only time disk IO is impacted is the initial read of the file to process and when writing out thousands of tiny split up files. That's just for a couple of minutes though and the most RAM I use is 80mb, drops down to 7-15mb when finished.

Which .NET platform are you targeting? This may fit in quite well with the Task Parallel Library.
 

guest2013-1

guest
Joined
Aug 22, 2003
Messages
19,800
.NET 3.5

I already have a nice CSV "file helper" that has excellent performance

Thanks for the idea, I think I could do the bytes processed as supposed to line count, but how do you get how many bytes have been processed?
 

dequadin

Expert Member
Joined
May 9, 2008
Messages
1,434
Thanks for the idea, I think I could do the bytes processed as supposed to line count, but how do you get how many bytes have been processed?

You could estimate it using the length of each line read from your StreamReader. Just keep a running total of each line's length. That way if your input only contains ASCII characters it'll be accurate. (ASCII is one byte per character).

You could get the exact value using the Encoding namespace (which will cater for Unicode encoding) but that seems like unnecessary overhead.

The FileInfo class should have something to tell you the filesize.
 
Top