Programming/Logic quandry

Solarion · Apr 10, 2019

I think I need someone with a more Einstein way of thinking here.

Problem: I have a large text file with x amount of thousands of rows. Management wants this text file split into smaller text files of 500 rows. The last file might only have 50 or 80 or even 499 rows depending on the remainder of what is left in the main file.

This is the basic block of logic I have so far.

C#:

public static void ReadFile(string file)
        {
            int y = 0;

            var lines = File.ReadAllLines(file);

            foreach (var line in lines)
            {
                y++;

                if(y == 500)
                {
                    y = 0; //each time this reaches 500 write to a new file and start the counter again.
                }
            }

        }

I'll bet you can see the problem already. What when I only have say 100 rows or even 499 rows left. How is my logic going to compensate for that remainder of rows in the file?

Perhaps I'm going about this all wrong in splitting up a text file. At this point I'm dead in the water.

Solarion · Apr 10, 2019

Think I've figured it out. Will post the new code block a bit later on. It's working now anyway just needed a smoke and some coffee

Johnatan56 · Apr 10, 2019

Can't you get number of lines/split it line by line?
If lines left > 0 keep going, check mod of number by 500 and store that to compare when need new file? Or reverse count, as long as lines less than total number of lines, keep going, if divisible by 500 no remainder new file and keep printing line until you hit 0/lines?

This seems like a second year question at most since it looks like you can read the entire file into memory.

bchip · Apr 10, 2019

Solarion said:
I think I need someone with a more Einstein way of thinking here.

My favourite Einstein thinking

Johnatan56 · Apr 10, 2019

bchip said:
My favourite Einstein thinking

View attachment 644138

Yeah, think @Solarion needs a break, should show this to his boss to say overworked.

Veroland · Apr 10, 2019

Check my comments in your code

Code:

public static void ReadFile(string file)
        {
            int y = 0;

            var lines = File.ReadAllLines(file);

            foreach (var line in lines)
            {
                y++;

                // Add line to a list

                if(y == 500)
                {
                    // Write list to file
                    // Set list to empty again
                    y = 0; //each time this reaches 500 write to a new file and start the counter again.
                }
            }

            if (y > 0) {
                // There is lines left over
                // Write list to file
            }

        }

Solarion · Apr 10, 2019

Thanks Veroland that looks pretty awesome I'll give that a try

That works beautifully

This is what I was ultimately missing. Think my brain is definitely fried for today.

C#:

if (y > 0) {
                // There is lines left over
                // Write list to file
            }

Veroland · Apr 10, 2019

Solarion said:
Thanks Veroland that looks pretty awesome I'll give that a try

That works beautifully

This is what I was ultimately missing. Think my brain is definitely fried for today.

Cool, most of the time simpler is better

Solarion · Apr 10, 2019

Veroland said:
Cool, most of the time simpler is better

Yeah I was way over thinking it. I actually had that check (similar) inside the loop and then need another variable and and and it became a bit much. The check outside of the loop was what did it.

gkm · Apr 11, 2019

Another option you could do without writing code is the Linux "split -l" command: https://www.computerhope.com/unix/usplit.htm

Easiest way to get it on Windows is probably by installing git bash - https://git-scm.com/downloads or any other interesting way to get access to bash commands.

Many languages also has a split command for arrays, but I am not sure if C# is one of them, without resorting to Linq.

Steamy Tom · Apr 11, 2019

Veroland said:
Cool, most of the time simpler is better

alwaaaaaaays

cguy · Apr 11, 2019

eye_suc said:
The amount of lines were known up front, right?
Could also have tried a normal for loop (i.e. for (int i = 0; i < lines.Count(); i++)) and then using Modulus (i % 500) to determine when to switch to a new file.

Glad you got sorted

or ( i & 0x1ff == 0 ), and behold the massive speed increase.

Spacerat · Apr 11, 2019

The examples above is brute force and has complexity of O(n) in terms of space. I.e. it loads the entire input file into memory. Not good. Below reads each line from input file and then writes it to appropriate output file. Thus O(1) in terms of space requirements == good.

C#:

    static void Main(string[] args)
    {
      int outputFileSize = 500;
      string inputFilename = @"Input.txt";
      string outputFilePattern = @"Output.{0}.txt";

      using (StreamReader reader = File.OpenText(inputFilename))
      {
        StreamWriter outputFile = null;
        int outputFileNo = 0;
        int lines = 0;

        while (true)
        {
          string line = reader.ReadLine();
          if (line == null)
          {
            // Cleanup
            if (outputFile != null)
            {
              outputFile.Close();
              outputFile.Dispose();
            }

            break;
          }

          if (lines % outputFileSize == 0) // Detect linecount to start writing to new file
          {
            // Close previous file
            if (outputFile != null)
            {
              outputFile.Close();
              outputFile.Dispose();
            }

            // Open new file
            outputFileNo++;
            string outputFilename = string.Format(outputFilePattern,outputFileNo);
            outputFile = File.AppendText(outputFilename);
          }

          // Write to file
          outputFile.WriteLine(line);
          lines++;
        }

      }
    }

[)roi(] · Apr 11, 2019

It can also be done with Linq.

C#:

File.ReadLines(@"path/filename.extension")
    .Select((x, i) => (e: x, g: i / 500)) // create a tuple of (e: element, g: group id)
    .GroupBy(x => x.g, x => x.e) // groupby group id, return only the element (x.e)
    .Select(x => x.ToList())
    .ForEach(x => ...<do something>...) // Action<T>;

Readlines is a deferred execution method that returns a IEnumerable<string> so the memory profile will also be constant, up to the point of the ForEach method call.

Steamy Tom · Apr 11, 2019

[)roi(] said:
It can also be done with Linq.

C#:

File.ReadLines(@"filename.extension") .Select((x, i) => (e: x, g: i / 500)) .GroupBy(x => x.g, x => x.e) .Select(x => x.ToList()) .ForEach(x => ...<do something>...) // Action<T>;

Readlines is a deferred execution method that returns a IEnumerable<string> so the memory profile will also be constant.

People like you deserve a slow horrible death

[)roi(] · Apr 11, 2019

Steamy Tom said:
People like you deserve a slow horrible death

Why?

Spacerat · Apr 11, 2019

[)roi(] said:
It can also be done with Linq.

C#:

File.ReadLines(@"path.filename.extension") .Select((x, i) => (e: x, g: i / 500)) // create a tuple of (e: element, g: group id) .GroupBy(x => x.g, x => x.e) // groupby group id, return only the element (x.e) .Select(x => x.ToList()) .ForEach(x => ...<do something>...) // Action<T>;

Readlines is a deferred execution method that returns a IEnumerable<string> so the memory profile will also be constant.

I like this, but are all the lines not read into memory in order to do the group by?
One nice thing about this is that you can parallelizse the writing to the files in the Action<T>

Steamy Tom · Apr 11, 2019

[)roi(] said:
Why?

You know what you have done and the pain you have caused to others

[)roi(] · Apr 11, 2019

Spacerat said:
I like this, but are all the lines not read into memory in order to do the group by?
One nice thing about this is that you can parallelizse the writing to the files in the Action<T>

GroupBy is also a deferred execution (lazy), meaning the memory compounding only occurs in the ForEach method call; but you are correct at that point the entire file would be loaded into memory; to workaround this one could do something similar to your code, by removing GroupBy, and instead using the x.g (group id) as a file suffix to append the line to the relevant external file.

C#:

File.ReadLines(@"path/filename.extension")
    .Select((x, i) => (e: x, g: i / 500))
    .Select(x => x.ToList())
    .ForEach(x => {
        using (var sw = File.AppendText($"path/outfilename{x.g}.extension")) {
          sw.WriteLine(x.e);
        }
    });

zippy · Apr 13, 2019

This thread is an excellent example of how to over think and over engineer a problem which was solved probably back in the 70’s or early 80’s

Edit: Google “unix split command on windows”

Join the MyBroadband community

Get started

Programming/Logic quandry

Honorary Master

Honorary Master

Honorary Master

Expert Member

Honorary Master

Executive Member

Honorary Master

Executive Member

Honorary Master

Expert Member

Executive Member

Honorary Master

Expert Member

Executive Member

Executive Member

Executive Member

Expert Member

Executive Member

Executive Member

Honorary Master