TOP(1) BY GROUP of very huge (100,000,000+) table

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty{ margin-bottom:0;
}

up vote
8
down vote

favorite

Setup

I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.

CREATE TABLE [data].[OperationData](

    [SourceDeciveID] [bigint] NOT NULL,

    [FileSource] [nvarchar](256) NOT NULL,

    [Size] [bigint] NULL,

    [Begin] [datetime2](7) NULL,

    [End] [datetime2](7) NOT NULL,

    [Date]  AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,

    [DataSetCount] [bigint] NULL,

    [Result] [int] NULL,

    [Error] [nvarchar](max) NULL,

    [Status] [int] NULL,

 CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED 

(

    [SourceDeviceID] ASC,

    [FileSource] ASC,

    [End] ASC

))



CREATE TABLE [model].[SourceDevice](

    [ID] [bigint] IDENTITY(1,1) NOT NULL,

    [Name] [nvarchar](50) NULL,

 CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED 

(

    [ID] ASC

))



ALTER TABLE [data].[OperationData]  WITH CHECK ADD  CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])

REFERENCES [model].[SourceDevice] ([ID])

The table is clustered at around 500 clusters and on a daily base.

partitions

enter image description here

Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.

Index based SELECTs are lightning fast and we had no problem with it.

Problem

I need to know the last (TOP) row by [End] and partitioned by [SourceDeciveID]. To get the very last [OperationData] of every source device.

Question

I need to find a way to solve this in a good way and without bringing the DB to the limits.

Effort 1

The first try was obvious GROUP BY or SELECT OVER PARTITION BY query. The problem here is also obvious, every query has to scan over very partition order/find the top row. So the query is very slow and has a very high IO impact.

Example query 1

;WITH cte AS

(

   SELECT *,

         ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn

   FROM [data].[OperationData]

)

SELECT *

FROM cte

WHERE rn = 1

Example query 2

SELECT *

FROM [data].[OperationData] AS d 

CROSS APPLY 

(

   SELECT TOP 1 *

   FROM [data].[OperationData] 

   WHERE [SourceDeciveID] = d.[SourceDeciveID]

   ORDER BY [End] DESC

) AS ds

FAILED!

Effort 2

I created a help table to always hold a reference to TOP row.

CREATE TABLE [data].[LastOperationData](

    [SourceDeciveID] [bigint] NOT NULL,

    [FileSource] [nvarchar](256) NOT NULL,

    [End] [datetime2](7) NOT NULL,

 CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED 

(

    [SourceDeciveID] ASC

)



ALTER TABLE [data].[LastOperationData]  WITH CHECK ADD  CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])

REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])

To fill up the table a created a trigger to always add/update the source row if higher [End] column is inserted.

CREATE TRIGGER [data].[OperationData_Last]

   ON  [data].[OperationData]

   AFTER INSERT

AS 

BEGIN

    SET NOCOUNT ON;



    MERGE [data].[LastOperationData] AS [target]

    USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])  

    ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])



    WHEN MATCHED AND [target].[End] < [source].[End] THEN

        UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]



    WHEN NOT MATCHED THEN  

        INSERT ([SourceDeciveID], [FileSource], [End])  

        VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);



END

The problem here is, that it also has a very huge IO impact and I don't know why.

As you can see here in the query plan it also executes a scan over the whole [OperationData] table.

It has a huge overall impact on my DB.
stats

FAILED!

edited Nov 28 at 21:35

asked Nov 28 at 10:13

Steffen Mangold

397113

2

In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 at 10:19

Yes sorry SSMS does not include it into the CREATE TABLE script but inside the query plan you will see the partitions. I will edit the question.
– Steffen Mangold
Nov 28 at 10:23

Not an extra index because included inside the PRIMARY KEY CLUSTERED you think it may help?
– Steffen Mangold
Nov 28 at 10:27

Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 at 10:30

@ypercubeᵀᴹ yes because the SELECT [SourceID], [Source], [End] FROM inserted some how do a table scan on the [OperationData].
– Steffen Mangold
Nov 28 at 10:32

|
show 2 more comments

up vote
8
down vote

favorite

Setup

I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.

CREATE TABLE [data].[OperationData](

    [SourceDeciveID] [bigint] NOT NULL,

    [FileSource] [nvarchar](256) NOT NULL,

    [Size] [bigint] NULL,

    [Begin] [datetime2](7) NULL,

    [End] [datetime2](7) NOT NULL,

    [Date]  AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,

    [DataSetCount] [bigint] NULL,

    [Result] [int] NULL,

    [Error] [nvarchar](max) NULL,

    [Status] [int] NULL,

 CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED 

(

    [SourceDeviceID] ASC,

    [FileSource] ASC,

    [End] ASC

))



CREATE TABLE [model].[SourceDevice](

    [ID] [bigint] IDENTITY(1,1) NOT NULL,

    [Name] [nvarchar](50) NULL,

 CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED 

(

    [ID] ASC

))



ALTER TABLE [data].[OperationData]  WITH CHECK ADD  CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])

REFERENCES [model].[SourceDevice] ([ID])

The table is clustered at around 500 clusters and on a daily base.

partitions

enter image description here

Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.

Index based SELECTs are lightning fast and we had no problem with it.

Problem

I need to know the last (TOP) row by [End] and partitioned by [SourceDeciveID]. To get the very last [OperationData] of every source device.

Question

I need to find a way to solve this in a good way and without bringing the DB to the limits.

Effort 1

Example query 1

;WITH cte AS

(

   SELECT *,

         ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn

   FROM [data].[OperationData]

)

SELECT *

FROM cte

WHERE rn = 1

Example query 2

SELECT *

FROM [data].[OperationData] AS d 

CROSS APPLY 

(

   SELECT TOP 1 *

   FROM [data].[OperationData] 

   WHERE [SourceDeciveID] = d.[SourceDeciveID]

   ORDER BY [End] DESC

) AS ds

FAILED!

Effort 2

I created a help table to always hold a reference to TOP row.

CREATE TABLE [data].[LastOperationData](

    [SourceDeciveID] [bigint] NOT NULL,

    [FileSource] [nvarchar](256) NOT NULL,

    [End] [datetime2](7) NOT NULL,

 CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED 

(

    [SourceDeciveID] ASC

)



ALTER TABLE [data].[LastOperationData]  WITH CHECK ADD  CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])

REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])

To fill up the table a created a trigger to always add/update the source row if higher [End] column is inserted.

CREATE TRIGGER [data].[OperationData_Last]

   ON  [data].[OperationData]

   AFTER INSERT

AS 

BEGIN

    SET NOCOUNT ON;



    MERGE [data].[LastOperationData] AS [target]

    USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])  

    ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])



    WHEN MATCHED AND [target].[End] < [source].[End] THEN

        UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]



    WHEN NOT MATCHED THEN  

        INSERT ([SourceDeciveID], [FileSource], [End])  

        VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);



END

The problem here is, that it also has a very huge IO impact and I don't know why.

As you can see here in the query plan it also executes a scan over the whole [OperationData] table.

It has a huge overall impact on my DB.
stats

FAILED!

edited Nov 28 at 21:35

asked Nov 28 at 10:13

Steffen Mangold

397113

2

In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 at 10:19

Yes sorry SSMS does not include it into the CREATE TABLE script but inside the query plan you will see the partitions. I will edit the question.
– Steffen Mangold
Nov 28 at 10:23

Not an extra index because included inside the PRIMARY KEY CLUSTERED you think it may help?
– Steffen Mangold
Nov 28 at 10:27

Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 at 10:30

@ypercubeᵀᴹ yes because the SELECT [SourceID], [Source], [End] FROM inserted some how do a table scan on the [OperationData].
– Steffen Mangold
Nov 28 at 10:32

|
show 2 more comments

up vote
8
down vote

favorite

Setup

I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.

CREATE TABLE [data].[OperationData](

    [SourceDeciveID] [bigint] NOT NULL,

    [FileSource] [nvarchar](256) NOT NULL,

    [Size] [bigint] NULL,

    [Begin] [datetime2](7) NULL,

    [End] [datetime2](7) NOT NULL,

    [Date]  AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,

    [DataSetCount] [bigint] NULL,

    [Result] [int] NULL,

    [Error] [nvarchar](max) NULL,

    [Status] [int] NULL,

 CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED 

(

    [SourceDeviceID] ASC,

    [FileSource] ASC,

    [End] ASC

))



CREATE TABLE [model].[SourceDevice](

    [ID] [bigint] IDENTITY(1,1) NOT NULL,

    [Name] [nvarchar](50) NULL,

 CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED 

(

    [ID] ASC

))



ALTER TABLE [data].[OperationData]  WITH CHECK ADD  CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])

REFERENCES [model].[SourceDevice] ([ID])

The table is clustered at around 500 clusters and on a daily base.

partitions

enter image description here

Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.

Index based SELECTs are lightning fast and we had no problem with it.

Problem

I need to know the last (TOP) row by [End] and partitioned by [SourceDeciveID]. To get the very last [OperationData] of every source device.

Question

I need to find a way to solve this in a good way and without bringing the DB to the limits.

Effort 1

Example query 1

;WITH cte AS

(

   SELECT *,

         ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn

   FROM [data].[OperationData]

)

SELECT *

FROM cte

WHERE rn = 1

Example query 2

SELECT *

FROM [data].[OperationData] AS d 

CROSS APPLY 

(

   SELECT TOP 1 *

   FROM [data].[OperationData] 

   WHERE [SourceDeciveID] = d.[SourceDeciveID]

   ORDER BY [End] DESC

) AS ds

FAILED!

Effort 2

I created a help table to always hold a reference to TOP row.

CREATE TABLE [data].[LastOperationData](

    [SourceDeciveID] [bigint] NOT NULL,

    [FileSource] [nvarchar](256) NOT NULL,

    [End] [datetime2](7) NOT NULL,

 CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED 

(

    [SourceDeciveID] ASC

)



ALTER TABLE [data].[LastOperationData]  WITH CHECK ADD  CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])

REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])

To fill up the table a created a trigger to always add/update the source row if higher [End] column is inserted.

CREATE TRIGGER [data].[OperationData_Last]

   ON  [data].[OperationData]

   AFTER INSERT

AS 

BEGIN

    SET NOCOUNT ON;



    MERGE [data].[LastOperationData] AS [target]

    USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])  

    ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])



    WHEN MATCHED AND [target].[End] < [source].[End] THEN

        UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]



    WHEN NOT MATCHED THEN  

        INSERT ([SourceDeciveID], [FileSource], [End])  

        VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);



END

The problem here is, that it also has a very huge IO impact and I don't know why.

As you can see here in the query plan it also executes a scan over the whole [OperationData] table.

It has a huge overall impact on my DB.
stats

FAILED!

edited Nov 28 at 21:35

asked Nov 28 at 10:13

Steffen Mangold

397113

Setup

I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.

CREATE TABLE [data].[OperationData](

    [SourceDeciveID] [bigint] NOT NULL,

    [FileSource] [nvarchar](256) NOT NULL,

    [Size] [bigint] NULL,

    [Begin] [datetime2](7) NULL,

    [End] [datetime2](7) NOT NULL,

    [Date]  AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,

    [DataSetCount] [bigint] NULL,

    [Result] [int] NULL,

    [Error] [nvarchar](max) NULL,

    [Status] [int] NULL,

 CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED 

(

    [SourceDeviceID] ASC,

    [FileSource] ASC,

    [End] ASC

))



CREATE TABLE [model].[SourceDevice](

    [ID] [bigint] IDENTITY(1,1) NOT NULL,

    [Name] [nvarchar](50) NULL,

 CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED 

(

    [ID] ASC

))



ALTER TABLE [data].[OperationData]  WITH CHECK ADD  CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])

REFERENCES [model].[SourceDevice] ([ID])

The table is clustered at around 500 clusters and on a daily base.

partitions

enter image description here

Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.

Index based SELECTs are lightning fast and we had no problem with it.

Problem

I need to know the last (TOP) row by [End] and partitioned by [SourceDeciveID]. To get the very last [OperationData] of every source device.

Question

I need to find a way to solve this in a good way and without bringing the DB to the limits.

Effort 1

Example query 1

;WITH cte AS

(

   SELECT *,

         ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn

   FROM [data].[OperationData]

)

SELECT *

FROM cte

WHERE rn = 1

Example query 2

SELECT *

FROM [data].[OperationData] AS d 

CROSS APPLY 

(

   SELECT TOP 1 *

   FROM [data].[OperationData] 

   WHERE [SourceDeciveID] = d.[SourceDeciveID]

   ORDER BY [End] DESC

) AS ds

FAILED!

Effort 2

I created a help table to always hold a reference to TOP row.

CREATE TABLE [data].[LastOperationData](

    [SourceDeciveID] [bigint] NOT NULL,

    [FileSource] [nvarchar](256) NOT NULL,

    [End] [datetime2](7) NOT NULL,

 CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED 

(

    [SourceDeciveID] ASC

)



ALTER TABLE [data].[LastOperationData]  WITH CHECK ADD  CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])

REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])

To fill up the table a created a trigger to always add/update the source row if higher [End] column is inserted.

CREATE TRIGGER [data].[OperationData_Last]

   ON  [data].[OperationData]

   AFTER INSERT

AS 

BEGIN

    SET NOCOUNT ON;



    MERGE [data].[LastOperationData] AS [target]

    USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])  

    ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])



    WHEN MATCHED AND [target].[End] < [source].[End] THEN

        UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]



    WHEN NOT MATCHED THEN  

        INSERT ([SourceDeciveID], [FileSource], [End])  

        VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);



END

The problem here is, that it also has a very huge IO impact and I don't know why.

As you can see here in the query plan it also executes a scan over the whole [OperationData] table.

It has a huge overall impact on my DB.
stats

FAILED!

sql-server index partitioning azure-sql-database

edited Nov 28 at 21:35

asked Nov 28 at 10:13

Steffen Mangold

397113

edited Nov 28 at 21:35

asked Nov 28 at 10:13

Steffen Mangold

397113

edited Nov 28 at 21:35

asked Nov 28 at 10:13

Steffen Mangold

397113

asked Nov 28 at 10:13

Steffen Mangold

397113

asked Nov 28 at 10:13

Steffen Mangold

397113

2

In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 at 10:19

Yes sorry SSMS does not include it into the CREATE TABLE script but inside the query plan you will see the partitions. I will edit the question.
– Steffen Mangold
Nov 28 at 10:23

Not an extra index because included inside the PRIMARY KEY CLUSTERED you think it may help?
– Steffen Mangold
Nov 28 at 10:27

Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 at 10:30

@ypercubeᵀᴹ yes because the SELECT [SourceID], [Source], [End] FROM inserted some how do a table scan on the [OperationData].
– Steffen Mangold
Nov 28 at 10:32

|
show 2 more comments

2

In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 at 10:19

Yes sorry SSMS does not include it into the CREATE TABLE script but inside the query plan you will see the partitions. I will edit the question.
– Steffen Mangold
Nov 28 at 10:23

Not an extra index because included inside the PRIMARY KEY CLUSTERED you think it may help?
– Steffen Mangold
Nov 28 at 10:27

Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 at 10:30

@ypercubeᵀᴹ yes because the SELECT [SourceID], [Source], [End] FROM inserted some how do a table scan on the [OperationData].
– Steffen Mangold
Nov 28 at 10:32

In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 at 10:19

Yes sorry SSMS does not include it into the CREATE TABLE script but inside the query plan you will see the partitions. I will edit the question.
– Steffen Mangold
Nov 28 at 10:23

Not an extra index because included inside the PRIMARY KEY CLUSTERED you think it may help?
– Steffen Mangold
Nov 28 at 10:27

Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 at 10:30

@ypercubeᵀᴹ yes because the SELECT [SourceID], [Source], [End] FROM inserted some how do a table scan on the [OperationData].
– Steffen Mangold
Nov 28 at 10:32

|
show 2 more comments

1 Answer
1

active

oldest

votes

up vote
9
down vote

If you have a table of SourceID values, and an index on your main table on (SourceID, End) include (othercolumns), just use OUTER APPLY.

SELECT d.*

FROM dbo.Sources s

OUTER APPLY (SELECT TOP (1) *

    FROM data.OperationData d

    WHERE d.SourceID = s.SourceID

    ORDER BY d.[End] DESC) d;

If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())

Edit: Because your clustered index is on SourceID, Source, End), put Source into your Sources table too and join on that as well. Then you don’t need the new index.

SELECT d.*

FROM dbo.Sources s -- Small table

OUTER APPLY (SELECT TOP (1) *

    FROM data.OperationData d -- Big table quick seeks

    WHERE d.SourceID = s.SourceID

    AND d.Source = s.Source

    AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination

    ORDER BY d.[End] DESC) d;

edited Nov 28 at 21:04

answered Nov 28 at 10:38

Rob Farley

13.4k12447

The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 at 15:37

1

@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 at 15:38

I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 at 21:25

@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 at 21:30

With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 at 3:46

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f223631%2ftop1-by-group-of-very-huge-100-000-000-table%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
9
down vote

If you have a table of SourceID values, and an index on your main table on (SourceID, End) include (othercolumns), just use OUTER APPLY.

SELECT d.*

FROM dbo.Sources s

OUTER APPLY (SELECT TOP (1) *

    FROM data.OperationData d

    WHERE d.SourceID = s.SourceID

    ORDER BY d.[End] DESC) d;

If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())

Edit: Because your clustered index is on SourceID, Source, End), put Source into your Sources table too and join on that as well. Then you don’t need the new index.

SELECT d.*

FROM dbo.Sources s -- Small table

OUTER APPLY (SELECT TOP (1) *

    FROM data.OperationData d -- Big table quick seeks

    WHERE d.SourceID = s.SourceID

    AND d.Source = s.Source

    AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination

    ORDER BY d.[End] DESC) d;

edited Nov 28 at 21:04

answered Nov 28 at 10:38

Rob Farley

13.4k12447

The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 at 15:37

1

@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 at 15:38

I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 at 21:25

@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 at 21:30

With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 at 3:46

add a comment |

up vote
9
down vote

If you have a table of SourceID values, and an index on your main table on (SourceID, End) include (othercolumns), just use OUTER APPLY.

SELECT d.*

FROM dbo.Sources s

OUTER APPLY (SELECT TOP (1) *

    FROM data.OperationData d

    WHERE d.SourceID = s.SourceID

    ORDER BY d.[End] DESC) d;

If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())

Edit: Because your clustered index is on SourceID, Source, End), put Source into your Sources table too and join on that as well. Then you don’t need the new index.

SELECT d.*

FROM dbo.Sources s -- Small table

OUTER APPLY (SELECT TOP (1) *

    FROM data.OperationData d -- Big table quick seeks

    WHERE d.SourceID = s.SourceID

    AND d.Source = s.Source

    AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination

    ORDER BY d.[End] DESC) d;

edited Nov 28 at 21:04

answered Nov 28 at 10:38

Rob Farley

13.4k12447

The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 at 15:37

1

@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 at 15:38

I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 at 21:25

@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 at 21:30

With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 at 3:46

add a comment |

up vote
9
down vote

If you have a table of SourceID values, and an index on your main table on (SourceID, End) include (othercolumns), just use OUTER APPLY.

SELECT d.*

FROM dbo.Sources s

OUTER APPLY (SELECT TOP (1) *

    FROM data.OperationData d

    WHERE d.SourceID = s.SourceID

    ORDER BY d.[End] DESC) d;

If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())

Edit: Because your clustered index is on SourceID, Source, End), put Source into your Sources table too and join on that as well. Then you don’t need the new index.

SELECT d.*

FROM dbo.Sources s -- Small table

OUTER APPLY (SELECT TOP (1) *

    FROM data.OperationData d -- Big table quick seeks

    WHERE d.SourceID = s.SourceID

    AND d.Source = s.Source

    AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination

    ORDER BY d.[End] DESC) d;

edited Nov 28 at 21:04

answered Nov 28 at 10:38

Rob Farley

13.4k12447

If you have a table of SourceID values, and an index on your main table on (SourceID, End) include (othercolumns), just use OUTER APPLY.

SELECT d.*

FROM dbo.Sources s

OUTER APPLY (SELECT TOP (1) *

    FROM data.OperationData d

    WHERE d.SourceID = s.SourceID

    ORDER BY d.[End] DESC) d;

If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())

Edit: Because your clustered index is on SourceID, Source, End), put Source into your Sources table too and join on that as well. Then you don’t need the new index.

SELECT d.*

FROM dbo.Sources s -- Small table

OUTER APPLY (SELECT TOP (1) *

    FROM data.OperationData d -- Big table quick seeks

    WHERE d.SourceID = s.SourceID

    AND d.Source = s.Source

    AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination

    ORDER BY d.[End] DESC) d;

edited Nov 28 at 21:04

answered Nov 28 at 10:38

Rob Farley

13.4k12447

edited Nov 28 at 21:04

answered Nov 28 at 10:38

Rob Farley

13.4k12447

answered Nov 28 at 10:38

Rob Farley

13.4k12447

answered Nov 28 at 10:38

Rob Farley

13.4k12447

The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 at 15:37

1

@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 at 15:38

I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 at 21:25

@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 at 21:30

With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 at 3:46

add a comment |

The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 at 15:37

1

@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 at 15:38

I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 at 21:25

@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 at 21:30

With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 at 3:46

The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 at 15:37

@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 at 15:38

I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 at 21:25

@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 at 21:30

With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 at 3:46

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Database Administrators Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here