Randomizing Text File StreamReader/Writer

FarligOpptreden · Mar 31, 2010

OH OH OH! This is gonna be good... Deq getting on his war-horse...

ROUND 1: Fight!

dequadin said:
So reading in the entire file is not an option.

Do you need the whole file randomized and the pieces, or could you just output the pieces? Max size of each piece?

That was also my knee-jerk reaction to the 2GB sized file. What about randomizing say 500 lines at a time and writing it into the new file and, if performance permits, do a second (or even third) pass on the new file, but increase the starting position by about 100 - 200 to randomize the "chunked" output?

dequadin · Mar 31, 2010

FarligOpptreden said:
OH OH OH! This is gonna be good... Deq getting on his war-horse...

lol, I've become this guy

FarligOpptreden said:
That was also my knee-jerk reaction to the 2GB sized file. What about randomizing say 500 lines at a time and writing it into the new file and, if performance permits, do a second (or even third) pass on the new file, but increase the starting position by about 100 - 200 to randomize the "chunked" output?

That's exactly the direction I was going with that comment...

guest2013-1 · Mar 31, 2010

Writing this in .NET btw, figured StreamReader/Writer is usually used in .NET, haven't seen it in any other language.

I don't have the option to output them into smaller pieces first and then randomizing their order. Because, for example, I have a datafeed containing books. It's about 3.1gig big. If I cut that into smaller pieces before randomizing the lines, I would just end up with randomizes lines in the alphabet of "A" *maybe* a "B" depending on where I cut it off, but many many MANY other pieces will only then randomize while alphabetized as "A".

The other issue is when you have more than 1 datafeed from more than 1 supplier. So now you have datafeed 1... A B C D E F and data feed 2 starting right after that, A B C D E F. Same problem, except, if both suppliers supply books (different kinds for example) then it's going to to have that linear type feel to it which I'm trying to avoid before processing these feeds into the database.

Don't ask me WHY they want it like this though, but apparently their current system processes the files as is. So if I can get the lines random then perfect.

dequadin · Apr 1, 2010

AcidRaZor said:
Writing this in .NET btw, figured StreamReader/Writer is usually used in .NET, haven't seen it in any other language.

I've seen similar things in Java & C++...

AcidRaZor said:
I don't have the option to output them into smaller pieces first and then randomizing their order. Because, for example, I have a datafeed containing books. It's about 3.1gig big. If I cut that into smaller pieces before randomizing the lines, I would just end up with randomizes lines in the alphabet of "A" *maybe* a "B" depending on where I cut it off, but many many MANY other pieces will only then randomize while alphabetized as "A".

The other issue is when you have more than 1 datafeed from more than 1 supplier. So now you have datafeed 1... A B C D E F and data feed 2 starting right after that, A B C D E F. Same problem, except, if both suppliers supply books (different kinds for example) then it's going to to have that linear type feel to it which I'm trying to avoid before processing these feeds into the database.

Don't ask me WHY they want it like this though, but apparently their current system processes the files as is. So if I can get the lines random then perfect.

This is a very interesting problem,

One possible way to handle this that comes to mind is to create bins. Since it's alphabetical, I'd say 26 bins (to hopefully get some sort of distribution).

What you could then do is read a line from the data feed, select a bin at random and insert that line into that bin. Hope fully you'd see something like this after you've read the entire bin:

Code:

Bin 1  Bin 2  .... Bin 26
A       A            A 
A       A            A
B       B            B
B       B            B
C       C            C      
C       C            C
etc    etc           etc

Then randomize each bin, then insert each bin into the database, picking them at random.

Another option might be to just read the whole dam feed into the database and use the database to randomize the data.

Where's sn3rd he's good at this type of thing...?

guest2013-1 · Apr 1, 2010

Thanks for the ideas; and he's probably on vacation. When you work for yourself it's hard to keep up when public holidays are and ****

sn3rd · Apr 1, 2010

dequadin said:
I've seen similar things in Java & C++...
Where's sn3rd he's good at this type of thing...?

lol

Been reading, but seems more like a .NET thing than anything else; maybe I'm way off base... Permuting lines "randomly" is easy enough.

My knowledge of the .NET library is getting worse by the day.

What is more important in this situation? Computational resources or I/O resources or memory? This will determine what's the best way to do it.

sn3rd · Apr 1, 2010

AcidRaZor said:
Thanks for the ideas; and he's probably on vacation. When you work for yourself it's hard to keep up when public holidays are and ****

Lol... Wishful thinking. I'm CUDA'ing

greggpb · Apr 1, 2010

dequadin said:
I've seen similar things in Java & C++...

This is a very interesting problem,

One possible way to handle this that comes to mind is to create bins. Since it's alphabetical, I'd say 26 bins (to hopefully get some sort of distribution).

What you could then do is read a line from the data feed, select a bin at random and insert that line into that bin. Hope fully you'd see something like this after you've read the entire bin:

Code:

Bin 1 Bin 2 .... Bin 26 A A A A A A B B B B B B C C C C C C etc etc etc

Then randomize each bin, then insert each bin into the database, picking them at random.

Another option might be to just read the whole dam feed into the database and use the database to randomize the data.

Where's sn3rd he's good at this type of thing...?

I really like this idea.. I would create temp file's/bin of 1000 -10000 lines depending on which variable better covers the file open/close create overhead.
the randomize those, then radomny concatenate the files.mean you need the disk space but not the memory and you can put in a setting for the amount of memory you want the program to use for increase speed(increase bin sizes).
Very import to make sure you alorythins for con cat and read and write are optimized as they will be the bottle neck

The problem is if you have duplicate lines and they are greater in number that the bin size ans you sort alphabetically.. you might have one bin with all they same type of line

dequadin · Apr 1, 2010

sn3rd said:
Been reading, but seems more like a .NET thing than anything else; maybe I'm way off base... Permuting lines "randomly" is easy enough.

The problem is he can't read in all the lines at once and then randomize them (which is easy), basically the only .NET limitation here is that the largest addressable memory "object" is around 2GB.

sn3rd · Apr 1, 2010

dequadin said:
The problem is he can't read in all the lines at once and then randomize them (which is easy), basically the only .NET limitation here is that the largest addressable memory "object" is around 2GB.

How about generating a random/pseudorandom number from 0 to numLines - 1 and pulling the corresponding line when it hits the jackpot?

greggpb · Apr 1, 2010

Or you could imitate paging result file could be
00000-10000.txt
10000-20000.txt
but this would result in a search/scan of the orignal file per line ?
unless when you write the tarjet file you append the line no to the front of the line so you can place the result in the write place and then delete the first x char of each line afterwards... 1 scan per line but atleast the result files might be smaller for the scans

dequadin · Apr 1, 2010

sn3rd said:
How about generating a random/pseudorandom number from 0 to numLines - 1 and pulling the corresponding line when it hits the jackpot?

The problem is the number of lines is unknown until you parse the entire file ...?

FarligOpptreden · Apr 1, 2010

I still don't fully understand why the contents have to be random? Can't the receiving file parse it if its not random, or what?

sn3rd · Apr 1, 2010

dequadin said:
The problem is the number of lines is unknown until you parse the entire file ...?

Why can't you make a single pass through the file to find that out? It surely shouldn't be that intensive?

And why can't you know that before parsing it? .NET is lame

FarligOpptreden · Apr 1, 2010

I've been reading up and a viable alternative to iterating through each line on large text files seems to be doing a Regex match for line-breaks... I think that might actually work, but I don't know how fast it would be though...

dequadin · Apr 1, 2010

sn3rd said:
Why can't you make a single pass through the file to find that out? It surely shouldn't be that intensive?

And why can't you know that before parsing it? .NET is lame

Of course you could, 3.1GB of text - eek!

FarligOpptreden said:
I've been reading up and a viable alternative to iterating through each line on large text files seems to be doing a Regex match for line-breaks... I think that might actually work, but I don't know how fast it would be though...

I also did a bit of research, saw a couple of guys suggesting regex...

HavocXphere · Apr 1, 2010

Borrowing heavily from Dequadin:

So we run through Inputfile with a while loop in a linear fashion, then we never don't need to know the line count so no run through first & don't need to keep of what has already been used.

Set up a bunch of dequadin's bins, number of bins being equal to this:

Acid said:
Once it's randomized I write it to a new file and then split it up into pieces

For each item append it to bins[RandomBinNum]. Then load each of the files into mem and run a Fisher-Yate shuffle on the now reasonable sized files & overwrite the unshuffled file.

Keeper · Apr 1, 2010

I have no idea of how .Net works, so i have a few questions.

If it takes that long to calculate how many lines the file has,

- Can you load any line of a txt file into a variable (or, check at least if it exists?)

- If you can (which i'm 99.9999% sure you could) and it is fast, then why not code a loop to check if line 1000 is there....then 2000, then 3000, etc, etc.... if you know there is more than 54,000 lines but less than 55,000 lines, you can narrow the script down into 100's (like check 54,100...then 54,200, etc,etc)

surely it would only be a couple of iterations before you can determine how many lines there are?

you could also check if line 100,000 exists - if it does not, half it - check line 50,000 - if it does exist, check 75,000 - if 75k does not exists, check half way between 50k and 75k....and so on....
(you basically half the "check value" each time.....higher half if found, lower half if not found)

so basically - all that is needed to know is - how long does it take to load say, line 50,000 of a file? does it need to process the WHOLE file?

dequadin · Apr 1, 2010

Keeper said:
I have no idea of how .Net works, so i have a few questions.

<snip>
....
</snip>
Does it need to process the WHOLE file?

Text is text, this is not a .NET problem, you'll have the same limitations in C++, Java, anything.

Here-in lies the problem. To calculate the number of lines in a text file you need to count the number of new line characters. Whether it's '\r\n' under Windows or '\n' under *nix, you need to count them and you have no idea where they are in the file.

Example:

Code:

This:
I am line one.
I am line two.

Is actually stored as:
I am line one.\r\nI am line two.

Just because the language to are using has a built in function to tell you how many lines are in a text file, under the hood it's parsing the entire file and counting new line characters.

So yes you'll always need to read the entire file - regardless of platform/language.

greggpb · Apr 1, 2010

dequadin said:
Text is text, this is not a .NET problem, you'll have the same limitations in C++, Java, anything.

Here-in lies the problem. To calculate the number of lines in a text file you need to count the number of new line characters. Whether it's '\r\n' under Windows or '\n' under *nix, you need to count them and you have no idea where they are in the file.

Example:

Code:

This: I am line one. I am line two. Is actually stored as: I am line one.\r\nI am line two.

Just because the language to are using has a built in function to tell you how many lines are in a text file, under the hood it's parsing the entire file and counting new line characters.

So yes you'll always need to read the entire file - regardless of platform/language.

The question is how many times you are gonna have to parse the entire file and how muhc of the file is gonna be in the buffer(memory) at a time. how random must the output file be ? that is super important, if the random part is not super importtant you could take a figure 8(could be any number) and read the souce line by line and apend it radomly to one of the 8 target file then concate the files afterwards..

Join the MyBroadband community

Get started

Randomizing Text File StreamReader/Writer

Executive Member

Expert Member

guest

Expert Member

guest

Expert Member

Expert Member

Expert Member

Expert Member

Expert Member

Expert Member

Expert Member

Executive Member

Expert Member

Executive Member

Expert Member

Honorary Master

Honorary Master

Expert Member

Expert Member