Programming/Logic quandry

Solarion

Honorary Master
Joined
Nov 14, 2012
Messages
28,050
Reaction score
17,804
I think I need someone with a more Einstein way of thinking here.

Problem: I have a large text file with x amount of thousands of rows. Management wants this text file split into smaller text files of 500 rows. The last file might only have 50 or 80 or even 499 rows depending on the remainder of what is left in the main file.

This is the basic block of logic I have so far.

C#:
public static void ReadFile(string file)
        {
            int y = 0;

            var lines = File.ReadAllLines(file);

            foreach (var line in lines)
            {
                y++;

                if(y == 500)
                {
                    y = 0; //each time this reaches 500 write to a new file and start the counter again.
                }
            }

        }

I'll bet you can see the problem already. What when I only have say 100 rows or even 499 rows left. How is my logic going to compensate for that remainder of rows in the file?

Perhaps I'm going about this all wrong in splitting up a text file. At this point I'm dead in the water.
 
Think I've figured it out. Will post the new code block a bit later on. It's working now anyway just needed a smoke and some coffee :thumbsup:
 
Can't you get number of lines/split it line by line?
If lines left > 0 keep going, check mod of number by 500 and store that to compare when need new file? Or reverse count, as long as lines less than total number of lines, keep going, if divisible by 500 no remainder new file and keep printing line until you hit 0/lines?

This seems like a second year question at most since it looks like you can read the entire file into memory. :p
 
Check my comments in your code

Code:
public static void ReadFile(string file)
        {
            int y = 0;

            var lines = File.ReadAllLines(file);

            foreach (var line in lines)
            {
                y++;

                // Add line to a list

                if(y == 500)
                {
                    // Write list to file
                    // Set list to empty again
                    y = 0; //each time this reaches 500 write to a new file and start the counter again.
                }
            }

            if (y > 0) {
                // There is lines left over
                // Write list to file
            }

        }
 
Thanks Veroland that looks pretty awesome I'll give that a try :thumbsup::thumbsup:

That works beautifully :)

This is what I was ultimately missing. Think my brain is definitely fried for today.

C#:
if (y > 0) {
                // There is lines left over
                // Write list to file
            }
 
Last edited:
Cool, most of the time simpler is better

Yeah I was way over thinking it. I actually had that check (similar) inside the loop and then need another variable and and and it became a bit much. The check outside of the loop was what did it.
 
Another option you could do without writing code is the Linux "split -l" command: https://www.computerhope.com/unix/usplit.htm

Easiest way to get it on Windows is probably by installing git bash - https://git-scm.com/downloads or any other interesting way to get access to bash commands.

Many languages also has a split command for arrays, but I am not sure if C# is one of them, without resorting to Linq.
 
The amount of lines were known up front, right?
Could also have tried a normal for loop (i.e. for (int i = 0; i < lines.Count(); i++)) and then using Modulus (i % 500) to determine when to switch to a new file.

Glad you got sorted :)

or ( i & 0x1ff == 0 ), and behold the massive speed increase. ;)
 
The examples above is brute force and has complexity of O(n) in terms of space. I.e. it loads the entire input file into memory. Not good. Below reads each line from input file and then writes it to appropriate output file. Thus O(1) in terms of space requirements == good.

C#:
    static void Main(string[] args)
    {
      int outputFileSize = 500;
      string inputFilename = @"Input.txt";
      string outputFilePattern = @"Output.{0}.txt";

      using (StreamReader reader = File.OpenText(inputFilename))
      {
        StreamWriter outputFile = null;
        int outputFileNo = 0;
        int lines = 0;

        while (true)
        {
          string line = reader.ReadLine();
          if (line == null)
          {
            // Cleanup
            if (outputFile != null)
            {
              outputFile.Close();
              outputFile.Dispose();
            }

            break;
          }

          if (lines % outputFileSize == 0) // Detect linecount to start writing to new file
          {
            // Close previous file
            if (outputFile != null)
            {
              outputFile.Close();
              outputFile.Dispose();
            }

            // Open new file
            outputFileNo++;
            string outputFilename = string.Format(outputFilePattern,outputFileNo);
            outputFile = File.AppendText(outputFilename);
          }

          // Write to file
          outputFile.WriteLine(line);
          lines++;
        }

      }
    }
 
It can also be done with Linq.
C#:
File.ReadLines(@"path/filename.extension")
    .Select((x, i) => (e: x, g: i / 500)) // create a tuple of (e: element, g: group id)
    .GroupBy(x => x.g, x => x.e) // groupby group id, return only the element (x.e)
    .Select(x => x.ToList())
    .ForEach(x => ...<do something>...) // Action<T>;

Readlines is a deferred execution method that returns a IEnumerable<string> so the memory profile will also be constant, up to the point of the ForEach method call.
 
Last edited:
It can also be done with Linq.
C#:
File.ReadLines(@"filename.extension")
    .Select((x, i) => (e: x, g: i / 500))
    .GroupBy(x => x.g, x => x.e)
    .Select(x => x.ToList())
    .ForEach(x => ...<do something>...) // Action<T>;

Readlines is a deferred execution method that returns a IEnumerable<string> so the memory profile will also be constant.

People like you deserve a slow horrible death
 
It can also be done with Linq.
C#:
File.ReadLines(@"path.filename.extension")
    .Select((x, i) => (e: x, g: i / 500)) // create a tuple of (e: element, g: group id)
    .GroupBy(x => x.g, x => x.e) // groupby group id, return only the element (x.e)
    .Select(x => x.ToList())
    .ForEach(x => ...<do something>...) // Action<T>;

Readlines is a deferred execution method that returns a IEnumerable<string> so the memory profile will also be constant.

I like this, but are all the lines not read into memory in order to do the group by?
One nice thing about this is that you can parallelizse the writing to the files in the Action<T>
 
I like this, but are all the lines not read into memory in order to do the group by?
One nice thing about this is that you can parallelizse the writing to the files in the Action<T>
GroupBy is also a deferred execution (lazy), meaning the memory compounding only occurs in the ForEach method call; but you are correct at that point the entire file would be loaded into memory; to workaround this one could do something similar to your code, by removing GroupBy, and instead using the x.g (group id) as a file suffix to append the line to the relevant external file.

C#:
File.ReadLines(@"path/filename.extension")
    .Select((x, i) => (e: x, g: i / 500))
    .Select(x => x.ToList())
    .ForEach(x => {
        using (var sw = File.AppendText($"path/outfilename{x.g}.extension")) {
          sw.WriteLine(x.e);
        }
    });
 
Last edited:
This thread is an excellent example of how to over think and over engineer a problem which was solved probably back in the 70’s or early 80’s :)

Edit: Google “unix split command on windows”
 
Top
Sign up to the MyBroadband newsletter
X