guest2013-1
guest
- Joined
- Aug 22, 2003
- Messages
- 19,800
Line count isn't the problem, however, as stated beforehand, to determine line count I have to run through the file once before to determine that. I'm already doing that to setup the progress bar feedback I have going with the background worker, so I can just reuse that number as I'm doing when I actually split the files.
The 3.1 gig file takes about 40 minutes to process which includes re-ordering the initial data columns, stripping any invalid HTML characters etc and delimiting it correctly (into CSV) and splitting it into 60k odd files with a line count of 55 each.
The use of this isn't what I can discuss here right now, but it's specific to the company and it's users, so necessary to have in there. Reading out a bunch of lines and randomizing only those lines and then combining them after won't be possible because with as big a dataset as the 3.1 gig file, I'm sure to run into several thousand lines starting with "A"
I like the bin idea. And I'm contemplating implementing this while I process the regex filter on the data itself. So while the file iterates through and processes the columns, the processed line would then be pushed into a randomized bin order of files, and combined into one file after everything is processed. This could then possibly give me a more random feel than what I currently have without incurring more resources or making the process slower.
3.1gig is the extreme though by the way... usually the files are under 2mb big. But I'd rather optimize for a huge file than later down the line going "oh crap, the **** is slow because I never knew it would be that big". So far the only time disk IO is impacted is the initial read of the file to process and when writing out thousands of tiny split up files. That's just for a couple of minutes though and the most RAM I use is 80mb, drops down to 7-15mb when finished.
The 3.1 gig file takes about 40 minutes to process which includes re-ordering the initial data columns, stripping any invalid HTML characters etc and delimiting it correctly (into CSV) and splitting it into 60k odd files with a line count of 55 each.
The use of this isn't what I can discuss here right now, but it's specific to the company and it's users, so necessary to have in there. Reading out a bunch of lines and randomizing only those lines and then combining them after won't be possible because with as big a dataset as the 3.1 gig file, I'm sure to run into several thousand lines starting with "A"
I like the bin idea. And I'm contemplating implementing this while I process the regex filter on the data itself. So while the file iterates through and processes the columns, the processed line would then be pushed into a randomized bin order of files, and combined into one file after everything is processed. This could then possibly give me a more random feel than what I currently have without incurring more resources or making the process slower.
3.1gig is the extreme though by the way... usually the files are under 2mb big. But I'd rather optimize for a huge file than later down the line going "oh crap, the **** is slow because I never knew it would be that big". So far the only time disk IO is impacted is the initial read of the file to process and when writing out thousands of tiny split up files. That's just for a couple of minutes though and the most RAM I use is 80mb, drops down to 7-15mb when finished.