Getting random data out of SQL Server

SQL Server is excellent for storing data and passing it back in nice ordered blocks, but what happens if you just want to take a random selection of data for example a random list of customers to test a new marketing campaign on.

Since SQL2005 there has been a clause called TABLESAMPLE that will allow you to get some random data from a table either as a percentage of the table or a set number of rows.

Select * from mytableTABLESAMPLE(10 PERCENT)

Select * from mytableTABLESAMPLE(100 ROWS)

You can also specify a REPEATABLE option so that if no updates, inserts or deletes have been made to the database inbetween queries the same results will be returned.

Select * from mytableTABLESAMPLE(10 PERCENT) REPEATABLE(1)

The (1) is an arbitrary value, as long as this stays the same between queries and no other changes have been made you will see the same results.

Unfortunately due to the way this clause works under the hood the results may not be truly random as it uses the way the data is stored internally in pages to sample information rather than taking the database rows as a whole and returning the sample from there.

This also has some limitations in that it cannot be used in Views, or Rowsets.

Another more flexible method to get random data out from a sql table is to simply sort by NEWID(), this works by temporarily creating a Guid for each row and sorting by it, and is a very effective way especially when combined with a TOP command to extract a random set of rows. A Guid or Globally Unique Identifier is a computer generated unique reference number that is 128bits long and should be unique as it can be generated with 2^122 unique values. A Guid usually takes the form {21EC2020-3AEA-4069-A2DD-08002B30309D}.

Select TOP 50 * from mytable order by NEWID()

This will return a random 50 rows from the table, but if the query is run a second time it will return a different 50 rows, although it is possible that some of the rows from the first query may appear in the second.

This method has the advantage that the order by clause can be included in almost existing query or view but care has to be taken about the order it appears in if there is already an existing ORDER BY clause.

Select top 50 * from mytable order by name,NEWID()

This will probably not return the results you are looking for as the query will perform a fixed sort before the random part you are looking for.

The query should be rewritten as

Select top 50 * from mytable order by NEWID(),name

This will return a random sample sorted in the order defined after the NEWID().

data t-sql

My occasional musings of a technical nature

Getting random data out of SQL Server

Leave a Reply Cancel reply