Fastest and Smallest data format

foozball3000 · Jan 28, 2009

We're working on software that has to work from very large datasets over a network, or locally. I'm thinking of using a SQL Dataset (one locally, and the large one remotely), but I'm not sure if this is the best way. Excel 2007 uses XML for it's data... and it seems fast.

What are the options?

guest2013-1 · Jan 28, 2009

foozball3000 said:
We're working on software that has to work from very large datasets over a network, or locally. I'm thinking of using a SQL Dataset (one locally, and the large one remotely), but I'm not sure if this is the best way. Excel 2007 uses XML for it's data... and it seems fast.

What are the options?

xml have improved in speed quite dramatically since .NET 1.0 and 1.1. You might want to work with them. Alternatively look into sqllite, firefox uses this and doesn't require you to install something like SQL 2005 to be able to work on datasets/offline from the server.

But to me xml is the way to go. Cache 'em locally and then do some sort of data collision check on records edited.

Otherwise, it seems you're in control of both the server and client, so you might consider compressing the dataset sent to the client and uncompress/load it on the client. Datasets are basically glorified xml files anyway so you should see a significant size reduction sending them over the network. Try google gzip dataset vb.net (or c#.net whatever floats your boat)

semaphore · Jan 28, 2009

AcidRaZor said:
xml have improved in speed quite dramatically since .NET 1.0 and 1.1. You might want to work with them. Alternatively look into sqllite, firefox uses this and doesn't require you to install something like SQL 2005 to be able to work on datasets/offline from the server.

But to me xml is the way to go. Cache 'em locally and then do some sort of data collision check on records edited.

Otherwise, it seems you're in control of both the server and client, so you might consider compressing the dataset sent to the client and uncompress/load it on the client. Datasets are basically glorified xml files anyway so you should see a significant size reduction sending them over the network. Try google gzip dataset vb.net (or c#.net whatever floats your boat)

Umm Xml is not the way to go as that is not the best means, compressing the dataset and decompressing will cause cycle over head.

Your best bet would be looking at something like googles protocol buffers. It is super awesome like totally.

Why not just use XML?

Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers:

* are simpler
* are 3 to 10 times smaller
* are 20 to 100 times faster
* are less ambiguous
* generate data access classes that are easier to use programmatically

For example, let's say you want to model a person with a name and an email. In XML, you need to do:

<person>
<name>John Doe</name>
<email>[email protected]</email>
</person>

while the corresponding protocol buffer message (in protocol buffer text format) is:

# Textual representation of a protocol buffer.
# This is *not* the binary format used on the wire.
person {
name: "John Doe"
email: "[email protected]"
}

When this message is encoded to the protocol buffer binary format (the text format above is just a convenient human-readable representation for debugging and editing), it would probably be 28 bytes long and take around 100-200 nanoseconds to parse. The XML version is at least 69 bytes if you remove whitespace, and would take around 5,000-10,000 nanoseconds to parse.

Also, manipulating a protocol buffer is much easier:

cout << "Name: " << person.name() << endl;
cout << "E-mail: " << person.email() << endl;

Whereas with XML you would have to do something like:

cout << "Name: "
<< person.getElementsByTagName("name")->item(0)->innerText()
<< endl;
cout << "E-mail: "
<< person.getElementsByTagName("email")->item(0)->innerText()
<< endl;

However, protocol buffers are not always a better solution than XML – for instance, protocol buffers would not be a good way to model a text-based document with markup (e.g. HTML), since you cannot easily interleave structure with text. In addition, XML is human-readable and human-editable; protocol buffers, at least in their native format, are not. XML is also – to some extent – self-describing. A protocol buffer is only meaningful if you have the message definition (the .proto file).

http://code.google.com/apis/protocolbuffers/docs/overview.html

icyrus · Jan 28, 2009

There is a saying: you have a data exchange problem, so you decide to use XML. Now you have two problems.

Make sure you are using the right tool for the right job. If you want better advice you'll need to provide more detailed info about your data, its structure and how you intend to use it.

semaphore · Jan 28, 2009

i believe he did say that he is going to use large data sets over a network

foozball3000 · Jan 28, 2009

It seems like Protocol Buffers can do the job. And will make working over the network a lot more seamless. I'll do some more research on that. Thanks.

I've heard a wise saying: There is no perfect Solution, only perfect Solutions.

I think for starters, we'll use SQL on the server, as .Net applications works very easily with it. If I find something better, I can always upgrade/migrate it.

The idea is to have an application that works with the data, and if the client's data is small enough, download it onto your pc, and work from there...otherwise it works directly from the server. The server will be in charge of indexing, weighing and testing the data, and eventually archiving the data.

guest2013-1 · Jan 28, 2009

semaphore said:
Umm Xml is not the way to go as that is not the best means, compressing the dataset and decompressing will cause cycle over head.

Your best bet would be looking at something like googles protocol buffers. It is super awesome like totally.

http://code.google.com/apis/protocolbuffers/docs/overview.html

You learn something new every day. But isn't that web based? i assumed the OP was talking about Windows Applications. Anyway, I'm always up for learning new stuff.

foozball3000 said:
It seems like Protocol Buffers can do the job. And will make working over the network a lot more seamless. I'll do some more research on that. Thanks.

I've heard a wise saying: There is no perfect Solution, only perfect Solutions.

I think for starters, we'll use SQL on the server, as .Net applications works very easily with it. If I find something better, I can always upgrade/migrate it.

The idea is to have an application that works with the data, and if the client's data is small enough, download it onto your pc, and work from there...otherwise it works directly from the server. The server will be in charge of indexing, weighing and testing the data, and eventually archiving the data.

The thing is, will the client always be connected to the network or would you like them to work from home/offline? I don't see a problem with you querying the SQL server directly if you keep to using stored procedures and filtered data.

It's not a good idea trying to return 1 million records to the grid of the application every time you refresh or do something. So writing it so that it only responds with x rows each times you might find that in the end you wouldn't need to worry about compressing and protocols or XML.

Unless by network you also meant "over the internet" which could slow down. Anyway, enjoy playing with Google

semaphore · Jan 28, 2009

AcidRaZor said:
You learn something new every day. But isn't that web based? i assumed the OP was talking about Windows Applications. Anyway, I'm always up for learning new stuff.

The thing is, will the client always be connected to the network or would you like them to work from home/offline? I don't see a problem with you querying the SQL server directly if you keep to using stored procedures and filtered data.

It's not a good idea trying to return 1 million records to the grid of the application every time you refresh or do something. So writing it so that it only responds with x rows each times you might find that in the end you wouldn't need to worry about compressing and protocols or XML.

Unless by network you also meant "over the internet" which could slow down. Anyway, enjoy playing with Google

It is a data interchange protocol, so no its not only for web, its used for communication between services such as backend servers.

marlinf · Feb 4, 2009

Looks like a superset of JSON to me. JSON is JavaScript Object Notation. It's a lot more efficient than using XML, and has the added benefit of translating transparently into JavaScript objects. So you can very easily manipulate the data in the browser - think Ajax, etc. At least in PHP, it's also super-simple to convert JSON to and from PHP objects.

marlinf · Feb 4, 2009

And using JSON will not tie you into using a specific library or coding platform to get things done. It's plain old JavaScript.

semaphore · Feb 4, 2009

marlinf said:
Looks like a superset of JSON to me. JSON is JavaScript Object Notation. It's a lot more efficient than using XML, and has the added benefit of translating transparently into JavaScript objects. So you can very easily manipulate the data in the browser - think Ajax, etc. At least in PHP, it's also super-simple to convert JSON to and from PHP objects.

Protocol buffers are not a subset of json.

Protocl buffers are used to communicate with inter-conneted machines, on a RPC call basis.

guest2013-1 · Feb 9, 2009

Read something interesting. For .NET 2.0, the default is SerializationFormat.XML. But by changing it to SerializationFormat.Binary you cut almost 3/4 of the file size. Making it quite small for over the network access

Join the MyBroadband community

Get started

Fastest and Smallest data format

foozball3000

Executive Member

guest2013-1

guest

semaphore

Honorary Master

icyrus

Executive Member

semaphore

Honorary Master

foozball3000

Executive Member

guest2013-1

guest

semaphore

Honorary Master

marlinf

Active Member

marlinf

Active Member

semaphore

Honorary Master

guest2013-1

guest