Fastest and Smallest data format

foozball3000

Executive Member
Joined
Oct 28, 2008
Messages
5,929
Reaction score
1,738
Location
Kyalami
We're working on software that has to work from very large datasets over a network, or locally. I'm thinking of using a SQL Dataset (one locally, and the large one remotely), but I'm not sure if this is the best way. Excel 2007 uses XML for it's data... and it seems fast.

What are the options?
 
We're working on software that has to work from very large datasets over a network, or locally. I'm thinking of using a SQL Dataset (one locally, and the large one remotely), but I'm not sure if this is the best way. Excel 2007 uses XML for it's data... and it seems fast.

What are the options?

xml have improved in speed quite dramatically since .NET 1.0 and 1.1. You might want to work with them. Alternatively look into sqllite, firefox uses this and doesn't require you to install something like SQL 2005 to be able to work on datasets/offline from the server.

But to me xml is the way to go. Cache 'em locally and then do some sort of data collision check on records edited.

Otherwise, it seems you're in control of both the server and client, so you might consider compressing the dataset sent to the client and uncompress/load it on the client. Datasets are basically glorified xml files anyway so you should see a significant size reduction sending them over the network. Try google gzip dataset vb.net (or c#.net whatever floats your boat)
 
xml have improved in speed quite dramatically since .NET 1.0 and 1.1. You might want to work with them. Alternatively look into sqllite, firefox uses this and doesn't require you to install something like SQL 2005 to be able to work on datasets/offline from the server.

But to me xml is the way to go. Cache 'em locally and then do some sort of data collision check on records edited.

Otherwise, it seems you're in control of both the server and client, so you might consider compressing the dataset sent to the client and uncompress/load it on the client. Datasets are basically glorified xml files anyway so you should see a significant size reduction sending them over the network. Try google gzip dataset vb.net (or c#.net whatever floats your boat)

Umm Xml is not the way to go as that is not the best means, compressing the dataset and decompressing will cause cycle over head.

Your best bet would be looking at something like googles protocol buffers. It is super awesome like totally. :P

Why not just use XML?

Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers:

* are simpler
* are 3 to 10 times smaller
* are 20 to 100 times faster
* are less ambiguous
* generate data access classes that are easier to use programmatically

For example, let's say you want to model a person with a name and an email. In XML, you need to do:

<person>
<name>John Doe</name>
<email>[email protected]</email>
</person>

while the corresponding protocol buffer message (in protocol buffer text format) is:

# Textual representation of a protocol buffer.
# This is *not* the binary format used on the wire.
person {
name: "John Doe"
email: "[email protected]"
}

When this message is encoded to the protocol buffer binary format (the text format above is just a convenient human-readable representation for debugging and editing), it would probably be 28 bytes long and take around 100-200 nanoseconds to parse. The XML version is at least 69 bytes if you remove whitespace, and would take around 5,000-10,000 nanoseconds to parse.

Also, manipulating a protocol buffer is much easier:

cout << "Name: " << person.name() << endl;
cout << "E-mail: " << person.email() << endl;

Whereas with XML you would have to do something like:

cout << "Name: "
<< person.getElementsByTagName("name")->item(0)->innerText()
<< endl;
cout << "E-mail: "
<< person.getElementsByTagName("email")->item(0)->innerText()
<< endl;

However, protocol buffers are not always a better solution than XML – for instance, protocol buffers would not be a good way to model a text-based document with markup (e.g. HTML), since you cannot easily interleave structure with text. In addition, XML is human-readable and human-editable; protocol buffers, at least in their native format, are not. XML is also – to some extent – self-describing. A protocol buffer is only meaningful if you have the message definition (the .proto file).

http://code.google.com/apis/protocolbuffers/docs/overview.html
 
There is a saying: you have a data exchange problem, so you decide to use XML. Now you have two problems.

Make sure you are using the right tool for the right job. If you want better advice you'll need to provide more detailed info about your data, its structure and how you intend to use it.
 
Last edited:
i believe he did say that he is going to use large data sets over a network:P
 
It seems like Protocol Buffers can do the job. And will make working over the network a lot more seamless. I'll do some more research on that. Thanks.

I've heard a wise saying: There is no perfect Solution, only perfect Solutions. :)

I think for starters, we'll use SQL on the server, as .Net applications works very easily with it. If I find something better, I can always upgrade/migrate it.

The idea is to have an application that works with the data, and if the client's data is small enough, download it onto your pc, and work from there...otherwise it works directly from the server. The server will be in charge of indexing, weighing and testing the data, and eventually archiving the data.
 
Umm Xml is not the way to go as that is not the best means, compressing the dataset and decompressing will cause cycle over head.

Your best bet would be looking at something like googles protocol buffers. It is super awesome like totally. :P

http://code.google.com/apis/protocolbuffers/docs/overview.html

You learn something new every day. But isn't that web based? i assumed the OP was talking about Windows Applications. Anyway, I'm always up for learning new stuff.

It seems like Protocol Buffers can do the job. And will make working over the network a lot more seamless. I'll do some more research on that. Thanks.

I've heard a wise saying: There is no perfect Solution, only perfect Solutions. :)

I think for starters, we'll use SQL on the server, as .Net applications works very easily with it. If I find something better, I can always upgrade/migrate it.

The idea is to have an application that works with the data, and if the client's data is small enough, download it onto your pc, and work from there...otherwise it works directly from the server. The server will be in charge of indexing, weighing and testing the data, and eventually archiving the data.

The thing is, will the client always be connected to the network or would you like them to work from home/offline? I don't see a problem with you querying the SQL server directly if you keep to using stored procedures and filtered data.

It's not a good idea trying to return 1 million records to the grid of the application every time you refresh or do something. So writing it so that it only responds with x rows each times you might find that in the end you wouldn't need to worry about compressing and protocols or XML.

Unless by network you also meant "over the internet" which could slow down. Anyway, enjoy playing with Google :p
 
You learn something new every day. But isn't that web based? i assumed the OP was talking about Windows Applications. Anyway, I'm always up for learning new stuff.



The thing is, will the client always be connected to the network or would you like them to work from home/offline? I don't see a problem with you querying the SQL server directly if you keep to using stored procedures and filtered data.

It's not a good idea trying to return 1 million records to the grid of the application every time you refresh or do something. So writing it so that it only responds with x rows each times you might find that in the end you wouldn't need to worry about compressing and protocols or XML.

Unless by network you also meant "over the internet" which could slow down. Anyway, enjoy playing with Google :p


It is a data interchange protocol, so no its not only for web, its used for communication between services such as backend servers.
 
Looks like a superset of JSON to me. JSON is JavaScript Object Notation. It's a lot more efficient than using XML, and has the added benefit of translating transparently into JavaScript objects. So you can very easily manipulate the data in the browser - think Ajax, etc. At least in PHP, it's also super-simple to convert JSON to and from PHP objects.
 
And using JSON will not tie you into using a specific library or coding platform to get things done. It's plain old JavaScript.
 
Looks like a superset of JSON to me. JSON is JavaScript Object Notation. It's a lot more efficient than using XML, and has the added benefit of translating transparently into JavaScript objects. So you can very easily manipulate the data in the browser - think Ajax, etc. At least in PHP, it's also super-simple to convert JSON to and from PHP objects.

Protocol buffers are not a subset of json.

Protocl buffers are used to communicate with inter-conneted machines, on a RPC call basis.
 
Read something interesting. For .NET 2.0, the default is SerializationFormat.XML. But by changing it to SerializationFormat.Binary you cut almost 3/4 of the file size. Making it quite small for over the network access
 
Top
Sign up to the MyBroadband newsletter
X