Need help with MySQL grouping

DrJohnZoidberg · Feb 25, 2015

Bar0n said:
How many records should we expect in this database with the points?

Main table has about 78k records at the moment, I maybe over-simplified the example in the OP, there are other columns I'm needing to group by also but wanted to make it clear what I wanted.

I'll hack it a bit more to see if I can come up with a solution, if not I'll post a more detailed example to see if someone can help.

Bar0n · Feb 25, 2015

SBSP said:
You could probably make use of the previous and next record and do a calculation using something like DATE_SUB then group on the result.

But how must it know which record to use as the "Base/Starting" point.
I.E
USER1 50 2014-01-16 23:10:03
USER1 22 2014-01-17 00:01:04
USER1 16 2014-01-17 00:59:17

You will end up with
USER1,72
USER1,32

So record 2's points is now shared with record 1 and 3.

Record 3 is < 1 hour compared to record 2 but > than 1 hour compared to record 1.

Yeah, I have a solution which solves Zoidberg's original problem, but will work incorrectly when the times overlap, as you've shown here.

Bar0n · Feb 25, 2015

DrJohnZoidberg said:
Main table has about 78k records at the moment, I maybe over-simplified the example in the OP, there are other columns I'm needing to group by also but wanted to make it clear what I wanted.

I'll hack it a bit more to see if I can come up with a solution, if not I'll post a more detailed example to see if someone can help.

I have a solution (only if there are no overlaps as in SBSP's post), but it involves a cross join. So that's way too impractical on a dataset of that size.

Bar0n · Feb 25, 2015

Actually this problem is very complex. You're essentially trying to solve the maximum subarray problem (see Introduction to Algorithms by Cormen and Rivest, 3rd edition, chapter 4.1, p. 68) with an added time constraint.

You can't just select an arbitrary starting point, otherwise you could be cheating users out of points claimed within any hourly period, and they will pick up on it fairly quickly.

EDIT: In addition, this problem may be so complex to solve (surely it's not NP-complete, I don't think?) that you'd be better off doing r-combinations and selecting the combinations offering maximum points.

DrJohnZoidberg · Feb 25, 2015

Would it simply things if I grouped by days (24 hour periods)?

Bar0n · Feb 25, 2015

DrJohnZoidberg said:
Would it simply things if I grouped by days (24 hour periods)?

Instead of 1 hour periods, 24 hour periods? No. It would simplify things if you did it by clock hours or calendar days.

prod · Feb 25, 2015

I got it working for your example, although the bad news is I used temp tables as well as a while loop, and its coded to be inefficient, plus it might have bugs! Also it's written is SQL, badly, but whatever ;P

Code:

CREATE TABLE #t(ID INT IDENTITY(1,1),USERNAME VARCHAR(10),POINTS INT,DATESTMP DATETIME,grp INT,DATESTMPADDED DATETIME)
INSERT INTO #t(USERNAME,POINTS,DATESTMP)
VALUES
('USER1',50,'2014-01-16 23:59:03'),('USER1',22,     '2014-01-17 00:01:04'),('USER1',    16,     '2014-01-17 00:19:17'),
('USER1',    11,     '2014-01-31 15:12:00'),
('USER1' ,   34 ,    '2014-01-31 15:19:34')

CREATE TABLE #t2(ID INT,USERNAME VARCHAR(10),POINTS INT,DATESTMP DATETIME,grp INT,DATESTMPADDED DATETIME)
INSERT INTO #t2
SELECT * FROM #t

UPDATE #t SET DATESTMPADDED = DATEADD(minute,30,DATESTMP)

DECLARE @id int = 1
		,@keeper INT = 1
		,@timestamp_added DATETIME = NULL
		,@timestamp_old DATETIME = NULL
		,@grpId int = 1

WHILE(@keeper < 5)
BEGIN
	SELECT @timestamp_added = DATESTMPADDED FROM #t WHERE #t.ID =@id
	UPDATE #t2 SET grp = @grpId WHERE #t2.DATESTMP < @timestamp_added and grp IS NULL
	SET @id = @@ROWCOUNT + 1
	SET @grpId = @grpId + 1
	SET @keeper = @keeper + 1
END

SELECT * FROM #t
SELECT ID,grp FROM #t2

SELECT USERNAME,SUM(POINTS) _points 
FROM #t2 
GROUP BY USERNAME,grp

drop table #t,#t2

BUT it might give you ideas on different approaches?

Edit: also, it might be a total fluke that it works for your example, but not for others. I just kinda went with how I would've approached it?

Bar0n · Feb 25, 2015

prOd said:
snip

His example is way too simple for the complexity of the problem. Your code will most likely fail on a larger, more complex example.

prod · Feb 25, 2015

Bar0n said:
His example is way too simple for the complexity of the problem. Your code will most likely fail on a larger, more complex example.

True, if you read everything I did say that multiple times. Still it might give him ideas. So...

Nerfherder · Feb 25, 2015

DrJohnZoidberg said:
Would it simply things if I grouped by days (24 hour periods)?

Im reading through this and Im still not sure if you want to group around a certain time on any day.... or around a certain time on a specific day ?

for the latter I do this

Code:

select CAST(month(datetime) AS VARCHAR(4))as 'Month' ,
   CAST(day(datetime) AS integer)as 'Day', DATENAME(dw,datetime)as 'Day of week',

Obviously you will rather choose the hour or minutes for you case.

The other thing you are trying to do... group by proximity to a time, figure out what your time periods are then you can do this

Code:

group by (cast (hour(datetime)AS VARCHAR(4) > 12:30 and cast (hour(datetime)AS VARCHAR(4) < 1:30 ) as 'around 12')

also group by day first.

DrJohnZoidberg · Feb 25, 2015

Thanks for the ideas! Still working on it

CheekyC · Feb 25, 2015

If you just want to group on relative time bins i.e. time elapsed between rows (the _DIFF_ of prev row and its successor < 1 hour), you cannot do it with a group by. Will never work. You have wrapping from one day to the next, one year to the next etc, Group By is not designed to do this. You will have to use a cursor and iterate, but the starting point is tricky.

If you want to aggregate all user points into absolute hourly time bins then this will work:

select
username,
year,
month,
day,
hour,
sum(points)
from
(
select
username,
points,
extract (year from DATESTAMP) as year,
extract (month from DATESTAMP) as month,
extract (day from DATESTAMP) as day,
extract (hour from DATESTAMP) as hour
from mytable
)
group by
username,
year,
month,
day,
hour

then you use a MERGE to insert/update into your summary table

Bar0n · Feb 25, 2015

To all the guys producing code, use this as a test case (transaction numbers + times + points):

10:00 (20 points)
10:59 (30 points)
11:01 (10 points)
11:50 (5 points)
11:58 (7 points)
12:02 (20 points)
12:40 (30 points)
12:50 (10 points)
13:20 (20 points)

The following are some options of many timepoint arrangements without overlapping; the timepoints in brackets all fall within 1 relative hour:

(10:00, 10:59) & (11:01, 11:50, 11:58) & (12:02, 12:40, 12:50) & (13:20)
(10:00, 10:59) & (11:01) & (11:50, 11:58, 12:02, 12:40, 12:50) & (13:20)
(10:00, 10:59) & (11:01, 11:50) & (11:58, 12:02, 12:40, 12:50) & (13:20)
(10:00) & (10:59, 11:01, 11:50, 11:58) & (12:02) & (12:40, 12:50, 13:20)

You'll have to be able to choose the most optimal combination to give your users the most points possible. Because if they sit down with a pen and paper and calculate otherwise, the shyte will hit the fan.

This is a difficult problem. It's not going to be solved by a simple SQL script with some subtraction and group-by statements. I'm leaning more and more towards a solution generated by r-combinations, which will be way too slow for the amount of records you have.

DrJohnZoidberg · Feb 25, 2015

Hmm, yep. I keep hitting a brick wall.

I'm thinking it would probably be much easier not doing the time group in the query and rather iterate through on my php backend system. I'll just make sure the query sorts it correctly.

Bar0n · Feb 25, 2015

DrJohnZoidberg said:
iterate through on my php backend system

Still not going to work optimally unless you implement r-combinations.

prod · Feb 25, 2015

Bar0n said:
To all the guys producing code, use this as a test case (transaction numbers + times + points):

10:00 (20 points)

10:59 (30 points)

11:01 (10 points)

11:50 (5 points)

11:58 (7 points)

12:02 (20 points)

12:40 (30 points)

12:50 (10 points)

13:20 (20 points)

The following are some options of many timepoint arrangements without overlapping; the timepoints in brackets all fall within 1 relative hour:

(10:00, 10:59) & (11:01, 11:50, 11:58) & (12:02, 12:40, 12:50) & (13:20)

(10:00, 10:59) & (11:01) & (11:50, 11:58, 12:02, 12:40, 12:50) & (13:20)

(10:00, 10:59) & (11:01, 11:50) & (11:58, 12:02, 12:40, 12:50) & (13:20)

(10:00) & (10:59, 11:01, 11:50, 11:58) & (12:02) & (12:40, 12:50, 13:20)

You'll have to be able to choose the most optimal combination to give your users the most points possible. Because if they sit down with a pen and paper and calculate otherwise, the shyte will hit the fan.

This is a difficult problem. It's not going to be solved by a simple SQL script with some subtraction and group-by statements. I'm leaning more and more towards a solution generated by r-combinations, which will be way too slow for the amount of records you have.

But if this is what OP wanted then his initial example is wrong as the 2nr and 3rd row can form it's own bucket.

Bar0n · Feb 25, 2015

prOd said:
But if this is what OP wanted then his initial example is wrong as the 2nr and 3rd row can form it's own bucket.

Hence why I said OP's example is not a good example to work from.

Waansin · Feb 26, 2015

Whatever happened to correlated subqueries and such things?

Problem is, you have to look at the first record and then find all following records that fall within 1 hour. Then look at the next record where the "next record" is the one that comes after the first record's time + 1 hour.

Repeat until finished.

Certainly sounds possible to do in a query (if MySQL supports correlated subqueries - http://dev.mysql.com/doc/refman/5.5/en/correlated-subqueries.html). Definitely possible in a stored procedure. The only downside is that it is not a simple single pass through the data.

Bar0n · Feb 26, 2015

Waansin said:
Problem is, you have to look at the first record and then find all following records that fall within 1 hour. Then look at the next record where the "next record" is the one that comes after the first record's time + 1 hour.

But then you would count some sets of points more than once. You can't let them overlap.

rorz0r · Feb 26, 2015

CheekyC said:
select
username,
year,
month,
day,
hour,
sum(points)
from
(
select
username,
points,
extract (year from DATESTAMP) as year,
extract (month from DATESTAMP) as month,
extract (day from DATESTAMP) as day,
extract (hour from DATESTAMP) as hour
from mytable
)
group by
username,
year,
month,
day,
hour

How is that different to this?

select
username,
points,
extract (year from DATESTAMP) as year,
extract (month from DATESTAMP) as month,
extract (day from DATESTAMP) as day,
extract (hour from DATESTAMP) as hour,
sum(points)
from
mytable
group by
username,
extract (year from DATESTAMP),
extract (month from DATESTAMP),
extract (day from DATESTAMP),
extract (hour from DATESTAMP)

Join the MyBroadband community

Get started

Need help with MySQL grouping

Honorary Master

Executive Member

Executive Member

Executive Member

Honorary Master

Executive Member

Executive Member

Executive Member

Executive Member

Honorary Master

Honorary Master

Senior Member

Executive Member

Honorary Master

Executive Member

Executive Member

Executive Member

Well-Known Member

Executive Member

Executive Member