Need help with MySQL grouping

How many records should we expect in this database with the points?

Main table has about 78k records at the moment, I maybe over-simplified the example in the OP, there are other columns I'm needing to group by also but wanted to make it clear what I wanted.

I'll hack it a bit more to see if I can come up with a solution, if not I'll post a more detailed example to see if someone can help.
 
You could probably make use of the previous and next record and do a calculation using something like DATE_SUB then group on the result.

But how must it know which record to use as the "Base/Starting" point.
I.E
USER1 50 2014-01-16 23:10:03
USER1 22 2014-01-17 00:01:04
USER1 16 2014-01-17 00:59:17

You will end up with
USER1,72
USER1,32

So record 2's points is now shared with record 1 and 3.


Record 3 is < 1 hour compared to record 2 but > than 1 hour compared to record 1.

Yeah, I have a solution which solves Zoidberg's original problem, but will work incorrectly when the times overlap, as you've shown here.
 
Main table has about 78k records at the moment, I maybe over-simplified the example in the OP, there are other columns I'm needing to group by also but wanted to make it clear what I wanted.

I'll hack it a bit more to see if I can come up with a solution, if not I'll post a more detailed example to see if someone can help.

I have a solution (only if there are no overlaps as in SBSP's post), but it involves a cross join. So that's way too impractical on a dataset of that size.
 
Actually this problem is very complex. You're essentially trying to solve the maximum subarray problem (see Introduction to Algorithms by Cormen and Rivest, 3rd edition, chapter 4.1, p. 68) with an added time constraint.

You can't just select an arbitrary starting point, otherwise you could be cheating users out of points claimed within any hourly period, and they will pick up on it fairly quickly.

EDIT: In addition, this problem may be so complex to solve (surely it's not NP-complete, I don't think?) that you'd be better off doing r-combinations and selecting the combinations offering maximum points. :p
 
Last edited:
I got it working for your example, although the bad news is I used temp tables as well as a while loop, and its coded to be inefficient, plus it might have bugs! Also it's written is SQL, badly, but whatever ;P

Code:
CREATE TABLE #t(ID INT IDENTITY(1,1),USERNAME VARCHAR(10),POINTS INT,DATESTMP DATETIME,grp INT,DATESTMPADDED DATETIME)
INSERT INTO #t(USERNAME,POINTS,DATESTMP)
VALUES
('USER1',50,'2014-01-16 23:59:03'),('USER1',22,     '2014-01-17 00:01:04'),('USER1',    16,     '2014-01-17 00:19:17'),
('USER1',    11,     '2014-01-31 15:12:00'),
('USER1' ,   34 ,    '2014-01-31 15:19:34')

CREATE TABLE #t2(ID INT,USERNAME VARCHAR(10),POINTS INT,DATESTMP DATETIME,grp INT,DATESTMPADDED DATETIME)
INSERT INTO #t2
SELECT * FROM #t

UPDATE #t SET DATESTMPADDED = DATEADD(minute,30,DATESTMP)

DECLARE @id int = 1
		,@keeper INT = 1
		,@timestamp_added DATETIME = NULL
		,@timestamp_old DATETIME = NULL
		,@grpId int = 1

WHILE(@keeper < 5)
BEGIN
	SELECT @timestamp_added = DATESTMPADDED FROM #t WHERE #t.ID =@id
	UPDATE #t2 SET grp = @grpId WHERE #t2.DATESTMP < @timestamp_added and grp IS NULL
	SET @id = @@ROWCOUNT + 1
	SET @grpId = @grpId + 1
	SET @keeper = @keeper + 1
END

SELECT * FROM #t
SELECT ID,grp FROM #t2

SELECT USERNAME,SUM(POINTS) _points 
FROM #t2 
GROUP BY USERNAME,grp

drop table #t,#t2

Capture.PNG

BUT it might give you ideas on different approaches?

Edit: also, it might be a total fluke that it works for your example, but not for others. I just kinda went with how I would've approached it?
 
Last edited:
His example is way too simple for the complexity of the problem. Your code will most likely fail on a larger, more complex example.

True, if you read everything I did say that multiple times. Still it might give him ideas. So...
 
Would it simply things if I grouped by days (24 hour periods)?

Im reading through this and Im still not sure if you want to group around a certain time on any day.... or around a certain time on a specific day ?

for the latter I do this

Code:
select CAST(month(datetime) AS VARCHAR(4))as 'Month' ,
   CAST(day(datetime) AS integer)as 'Day', DATENAME(dw,datetime)as 'Day of week',


Obviously you will rather choose the hour or minutes for you case.

The other thing you are trying to do... group by proximity to a time, figure out what your time periods are then you can do this

Code:
group by (cast (hour(datetime)AS VARCHAR(4) > 12:30 and cast (hour(datetime)AS VARCHAR(4) < 1:30 ) as 'around 12')

also group by day first.
 
Last edited:
If you just want to group on relative time bins i.e. time elapsed between rows (the _DIFF_ of prev row and its successor < 1 hour), you cannot do it with a group by. Will never work. You have wrapping from one day to the next, one year to the next etc, Group By is not designed to do this. You will have to use a cursor and iterate, but the starting point is tricky.

If you want to aggregate all user points into absolute hourly time bins then this will work:

select
username,
year,
month,
day,
hour,
sum(points)
from
(
select
username,
points,
extract (year from DATESTAMP) as year,
extract (month from DATESTAMP) as month,
extract (day from DATESTAMP) as day,
extract (hour from DATESTAMP) as hour
from mytable
)
group by
username,
year,
month,
day,
hour

then you use a MERGE to insert/update into your summary table
 
To all the guys producing code, use this as a test case (transaction numbers + times + points):
  1. 10:00 (20 points)
  2. 10:59 (30 points)
  3. 11:01 (10 points)
  4. 11:50 (5 points)
  5. 11:58 (7 points)
  6. 12:02 (20 points)
  7. 12:40 (30 points)
  8. 12:50 (10 points)
  9. 13:20 (20 points)

The following are some options of many timepoint arrangements without overlapping; the timepoints in brackets all fall within 1 relative hour:
  • (10:00, 10:59) & (11:01, 11:50, 11:58) & (12:02, 12:40, 12:50) & (13:20)
  • (10:00, 10:59) & (11:01) & (11:50, 11:58, 12:02, 12:40, 12:50) & (13:20)
  • (10:00, 10:59) & (11:01, 11:50) & (11:58, 12:02, 12:40, 12:50) & (13:20)
  • (10:00) & (10:59, 11:01, 11:50, 11:58) & (12:02) & (12:40, 12:50, 13:20)

You'll have to be able to choose the most optimal combination to give your users the most points possible. Because if they sit down with a pen and paper and calculate otherwise, the shyte will hit the fan. :p

This is a difficult problem. It's not going to be solved by a simple SQL script with some subtraction and group-by statements. I'm leaning more and more towards a solution generated by r-combinations, which will be way too slow for the amount of records you have.
 
Hmm, yep. I keep hitting a brick wall.

I'm thinking it would probably be much easier not doing the time group in the query and rather iterate through on my php backend system. I'll just make sure the query sorts it correctly.
 
To all the guys producing code, use this as a test case (transaction numbers + times + points):
  1. 10:00 (20 points)
  2. 10:59 (30 points)
  3. 11:01 (10 points)
  4. 11:50 (5 points)
  5. 11:58 (7 points)
  6. 12:02 (20 points)
  7. 12:40 (30 points)
  8. 12:50 (10 points)
  9. 13:20 (20 points)

The following are some options of many timepoint arrangements without overlapping; the timepoints in brackets all fall within 1 relative hour:
  • (10:00, 10:59) & (11:01, 11:50, 11:58) & (12:02, 12:40, 12:50) & (13:20)
  • (10:00, 10:59) & (11:01) & (11:50, 11:58, 12:02, 12:40, 12:50) & (13:20)
  • (10:00, 10:59) & (11:01, 11:50) & (11:58, 12:02, 12:40, 12:50) & (13:20)
  • (10:00) & (10:59, 11:01, 11:50, 11:58) & (12:02) & (12:40, 12:50, 13:20)

You'll have to be able to choose the most optimal combination to give your users the most points possible. Because if they sit down with a pen and paper and calculate otherwise, the shyte will hit the fan. :p

This is a difficult problem. It's not going to be solved by a simple SQL script with some subtraction and group-by statements. I'm leaning more and more towards a solution generated by r-combinations, which will be way too slow for the amount of records you have.

But if this is what OP wanted then his initial example is wrong as the 2nr and 3rd row can form it's own bucket.
 
But if this is what OP wanted then his initial example is wrong as the 2nr and 3rd row can form it's own bucket.

Hence why I said OP's example is not a good example to work from.
 
Whatever happened to correlated subqueries and such things?

Problem is, you have to look at the first record and then find all following records that fall within 1 hour. Then look at the next record where the "next record" is the one that comes after the first record's time + 1 hour.

Repeat until finished.

Certainly sounds possible to do in a query (if MySQL supports correlated subqueries - http://dev.mysql.com/doc/refman/5.5/en/correlated-subqueries.html). Definitely possible in a stored procedure. The only downside is that it is not a simple single pass through the data.
 
Last edited:
Problem is, you have to look at the first record and then find all following records that fall within 1 hour. Then look at the next record where the "next record" is the one that comes after the first record's time + 1 hour.

But then you would count some sets of points more than once. You can't let them overlap.
 
select
username,
year,
month,
day,
hour,
sum(points)
from
(
select
username,
points,
extract (year from DATESTAMP) as year,
extract (month from DATESTAMP) as month,
extract (day from DATESTAMP) as day,
extract (hour from DATESTAMP) as hour
from mytable
)
group by
username,
year,
month,
day,
hour


How is that different to this?

select
username,
points,
extract (year from DATESTAMP) as year,
extract (month from DATESTAMP) as month,
extract (day from DATESTAMP) as day,
extract (hour from DATESTAMP) as hour,
sum(points)
from
mytable
group by
username,
extract (year from DATESTAMP),
extract (month from DATESTAMP),
extract (day from DATESTAMP),
extract (hour from DATESTAMP)
 
Top
Sign up to the MyBroadband newsletter
X