Note

This tech note is a collection of Qserv prototyping experiments drawn from different locations. The first two sections were originally published in [4] and subsequently [9] (see DocuShare version 18). Document LDM-135 no longer contains the historical background and the content has been migrated to different locations. The later sections are tests from LSST Trac that were referenced from the first two sections of LDM-135.

1 Design Trade-Offs¶

The LSST database design involves many architectural choices. Example of architectural decisions we faced include how to partition the tables, how many levels of partitioning is needed, where to use an index, how to normalize the tables, or how to support joins of the largest tables. This chapter covers the test we run to determine the optimal architecture of MySQL-based system.

1.1 Standalone Tests¶

1.1.1 Spatial join performance¶

This test was run to determine how quickly we can do a spatial self-join (find objects within certain spatial distance of other objects) inside a single table. Ultimately, in our architecture, a single table represents a single partition (or sup-partition). The test involved trying various options and optimizations such as using different indexes (clustered and non clustered), precalculating various values (like COS(RADIANS(decl))), and reordering predicates. We run these tests for all reasonable table sizes (using MySQL and PostgreSQL). We measured CPU and disk I/O to estimate impact on hardware. In addition, we re-run these tests on the lsst10 machine at NCSA to understand what performance we may expect there for DC3b [11]. These tests are documented below. We found that PostgreSQL was 3.7x slower for spatial joins over a range of row counts, and reducing the row-count per partition to less than 5k rows was crucial in achieving lowering compute intensity, but that predicate selectivity could compensate for a 2-4x greater row count.

1.1.2 Building sub-partitions¶

Based on the “spatial join performance” test we determined that in order to speed up self-joins within individual tables (partitions), these partitions need to be very small, \(O(\mathrm{few~} K)\) rows. However, if we partition large tables into a very large number of small tables, this will result in unmanageable number of tables (files). So, we determined we need a second level of partitioning, which we call sub-partition on the fly. This test included:

sub-partitioning through queries:
1. one query to generate one sub-partition
2. relying on specially introduced column (subPartitionId).
segregating data into sub-partitions in a client C++ program, including using a binary protocol.

We timed these tests. This test is described below. These tests showed that it was fastest to build the on-the-fly sub-partitions using SQL in the engine, rather than performing the task externally and loading the sub-partitions back into the engine.

1.1.3 Sub-partition overhead¶

We also run detailed tests to determine overhead of introducing sub-partitions. For this test we used a 1 million row table, measured cost of a full table scan of such table, and compared it against scanning through a corresponding data set partitioned into sub-partitioned. The tests involved comparing in-memory with disk-based tables. We also tested the influence of introducing “skinny” tables, as well as running sub-partitioning in a client C++ program, and inside a stored procedure. These tests are described at below. The on-the-fly overhead was measured to be 18% for select * queries, but 3600% if only one column (the skinniest selection) was needed.

1.1.4 Avoiding materializing sub-partitions¶

We tried to run near neighbor query on a 1 million row table. A starting point is 1000 sec which is ~16 min 40 sec (based on earlier tests we determined it takes 1 sec to do near neighbor for 1K row table).

The testing included:

Running near neighbor query by selecting rows with given subChunkId into in memory table and running near neighbor query there. It took 7 min 43 sec.
Running near neighbor query by running neighbor once for each subChunkId, without building sub-chunks. It took 39 min 29 sec.
Running near neighbor query by mini-near neighbor once for each subChunkId, without building sub-chunks, using in-memory table. It took 13 min 13 sec.

1.1.5 Billion row table / reference catalog¶

One of the catalogs we will need to support is the reference catalog, even in DC3b it is expected to contain about one billion rows. We have run tests with a table containing 1 billion rows catalog (containing USNO-B data) to determine how feasible it is to manage a billion row table without partitioning it. These tests are described in detail below. This revealed that single 1-billion row table usage is adequate in loading and indexing, but query performance was only acceptable when the query predicates selectivity using an index was a small absolute number of rows (1% selectivity is too loose). Thus a large fraction of index-scans were unacceptably slow and the table join speed was also slow.

1.1.6 Compression¶

We have done extensive tests to determine whether it is cost effective to compress LSST databases. This included measuring how different data types and indexes compress, and performance of compressing and decompressing data. These tests are described in details below. We found that table data compressed only 50%, but since indexes were not compressed, there was only about 20% space savings. Table scans are significantly slower due to CPU expense, but short, indexed queries were only impacted 40-50%.

1.1.7 Full table scan performance¶

To determine performance of full table scan, we measured:

raw disk speed with dd if=<large file> of=/dev/zero and got 54.7 MB/sec (2,048,000,000 bytes read in 35.71 sec)
speed of select count(*) from XX where muRA = 4.3 using a 1 billion row table. There was no index on muRA, so this forced a full table scan. Note that we did not do SELECT * to avoid measuring speed of converting attributes. The scan of 72,117,127,716 bytes took 28:49.82 sec, which is 39.8 MB/sec.

So, based on this test the full table scan can be done at 73% of the raw disk speed (using MySQL MyISAM).

1.1.8 Low-volume queries¶

A typical low-volume queries to the best of our knowledge can be divided into two types:

analysis of a single object. This typically involves locating a small number of objects (typically just one) with given objectIds, for example find object with given id, select attributes of a given galaxy, extract time series for a given star, or select variable objects near known galaxy. Corresponding representative queries:
```
SELECT * from Object where objectId=<xx>
SELECT * from Source where objectId =<xx>
```
analysis of objects meeting certain criteria in a small spatial region. This can be represented by a query that selects objects in a given small ra/dec bounding box, so e.g.:
```
SELECT * FROM Object
WHERE ra BETWEEN :raMin AND :raMax
AND decl BETWEEN :declMin AND :declMax
AND zMag BETWEEN :zMin AND :zMax
```

Each such query will typically touch one or a few partitions (few if the needed area is near partition edge). In this test we measured speed for a single partition.

Proposed partitioning scheme will involve partitioning each large table into a “reasonable” number of partitions, typically measured in low tens of thousands. Details analysis are done in the storage spreadsheet [7]. Should we need to, we can partition the largest tables into larger number of smaller partitions, which would reduce partition size. Given the hardware available and our time constraints, so far we have run tests with up to 10 million row partition size.

We determined that if we use our custom spatial index (“subChunkId”), we can extract 10K rows out of a 10 million row table in 30 sec. This is too long – low volume queries require under 10 sec response time. However, if we re-sort the table based on our spatial index, that same query will finish in under 0.33 sec.

We expect to have 50 low volume queries running at any given time. Based on details disk I/O estimates, we expect to have ~200 disk spindles available in DR1, many more later. Thus, it is likely majority of low volume queries will end up having a dedicated disk spindle, and for these that will end up sharing the same disk, caching will likely help.

Note that these tests were done on fairly old hardware (7 year old).

In summary, we demonstrated low-volume queries can be answered through an index (objectId or spatial) in well under 10 sec.

1.1.9 Solid state disks¶

We also run a series of tests with solid state disks to determine where it would be most cost-efficient to use solid state disks. The tests are described in details in [8]. We found that concurrent query execution is dominated by software inefficiencies when solid-state devices (SSDs) with fast random I/O are substituted for slow disks. Because the cost per byte is higher for SSDs, spinning disks are cheaper for bulk storage, as long as access is mostly sequential (which can be facilitated with shared scanning). However, because the cost per random I/O is much lower for SSDs than for spinning disks, using SSDs for serving indexes, exposure metadata, perhaps even the entire Object catalog, as well as perhaps for temporary storage is advised. This is true for the price/performance points of today’s SSDs. Yet even with high IOPS performance from SSDs, table-scan based selection is often faster than index-based selection: a table-scan is faster than an index scan when >9% of rows are selected (cutoff is >1% for spinning disk). The commonly used 30% cutoff does not apply for large tables for present storage technology.

1.2 Data Challenge Related Tests¶

During each data challenge we test some aspects of database performance and/or scalability. In DC1 [13] we demonstrated ingest into database at the level of 10% of DR1, in DC2 [1] we demonstrated near-real-time object association, DC3 [11] is demonstrating catalog construction and DC4 will demonstrate the end user query/L3 data production.

In addition to DC-related tests, we are running standalone tests, described in detail in [3], [6] and [12].

1.2.1 DC1: data ingest¶

We ran detailed tests to determine data ingest performance. The test included comparing ingest speed of MySQL against SQL Server speed, and testing different ways of inserting data to MySQL, including direct ingest through INSERT INTO query, loading data from ASCII CSV files. In both cases we tried different storage engines, including MyISAM and InnoDB. Through these tests we determined the overhead introduced by MySQL is small (acceptable). Building indexes for large tables is slow, and requires making a full copy of the involved table. These tests are described in details in Docushare Document-1386. We found that as long as indexes are disabled during loading, ingest speed is typically CPU bound due to data conversion from ASCII to binary format. We also found that ingest into InnoDB is usually ~3x slower than into MyISAM, independently of table size.

1.2.2 DC2: source/object association¶

One of the requirements is to associated DiaSource with Object is almost real-time. Detailed study how to achieve that has been done in conjunction with the Data Challenge 2. The details are covered below and the pages linked from there. We determined that we need to maintain a narrow subset of the data, and fetch it from disk to memory right before the time-critical association in order to minimize database-related delays.

1.2.3 DC3: catalog construction¶

In DC3 we demonstrated catalog creation as part of the Data Release Production.

1.2.4 Winter-2013 Data Challenge: querying database for forced photometry¶

Prior to running Winter-2013 Data Challenge, we tested performance of MySQL to determine whether the database will be able to keep up with forced photometry production which runs in parallel. We determined that a single MySQL server is able to easily handle 100-200 simultaneous requests in well under a second. As a result we chose to rely on MySQL to supply input data for forced photometry production. Running the production showed it was the right decision, e.g., the database performance did not cause any problems. The test is documented below.

1.2.5 Winter-2013 Data Challenge: partitioning 2.6 TB table for Qserv¶

The official Winter-2013 production database, as all past data challenged did not rely on Qserv, instead, plain MySQL was used instead. However, as an exercise we partitioned and loaded this data set into Qserv. This data set relies on table views, so extending the administrative tools and adding support for views inside Qserv was necessary. In the process, administrative tools were improved to flexibly use arbitrary number of batch machines for partitioning and loading the data. Further, we added support for partitioning RefMatch* tables; RefMatch objects and sources have to be partitioned in a unique way to ensure they join properly with the corresponding Object and Source tables.

1.2.6 Winter-2013 Data Challenge: multi-billion-row table¶

The early Winter 2013 production resulted in 2.6 TB database; the largest table, ForcedSource, had nearly 4 billion rows.[*] Dealing with multi-billion row table is non-trivial are requires special handling and optimizations. Some operations, such as building an index tend to take a long time (tens of hours), and a single ill-tuned variable can result in 10x (or worse) performance degradation. Producing the final data set in several batches was in particular challenging, as we had to rebuild indexes after inserting data from each batch. Key lessons learned have been documented at https://dev.lsstcorp.org/trac/wiki/mysqlLargeTables. Issues we uncovered with MySQL (myisamchk) had been reported to the MySQL developers, and were fixed immediately.

In addition, some of the more complex queries, in particular these with spatial constraints had to be optimized.[†] The query optimizations have been documented at https://dev.lsstcorp.org/trac/wiki/db/MySQL/Optimizations.

2 Other Demonstrations¶

2.1 Shared Scans¶

We have conducted preliminary empirical evaluation of our basic shared scan implementation. The software worked exactly as expected, and we have not discovered any unforeseen challenges. For the tests we used a mix of queries with a variety of filters, different CPU load, different result sizes, some with grouping, some with aggregations, some with complex math. Specifically, we have measured the following:

A single full table scan through the Object table took ~3 minutes. Running a mix 30 such queries using our shared scan code took 5 min 27 sec (instead of expected ~1.5 hour it’d take if we didn’t use the shared scan code.)
A single full table scan through Source table took between ~14 min 26 sec and 14 min 36 sec depending on query complexity. Running a mix of 30 such queries using shares scan code took 25 min 30 sec. (instead of over 7 hours).

In both cases the extra time it took comparing to the timing of a single query was related to (expected) CPU contention: we have run 30 simultaneous queries on a slow, 4-core machine.

In addition, we demonstrated running simultaneously a shared scan plus short, interactive queries. The interactive queries completed as expected, in some cases with a small (1-2 sec) delay.

2.2 Fault Tolerance¶

To prove Qserv can gracefully handle faults, we artificially triggered different error conditions, such as corrupting random parts of a internal MySQL files while Qserv is reading them, or corrupting data sent between various components of the Qserv (e.g., from the XRootD to the master process).

2.2.1 Worker failure¶

These tests are meant to simulate worker failure in general, including spontaneous termination of a worker process and/or inability to communicate with a worker node.

When a relevant worker (i.e. one managing relevant data) has failed prior to query execution, either 1) duplicate data exists on another worker node, in which case XRootD silently routes requests from the master to this other node, or 2) the data is unavailable elsewhere, in which case XRootD returns an error code in response to the master’s request to open for write. The former scenario has been successfully demonstrated during multi-node cluster tests. In the latter scenario, Qserv gracefully terminates the query and returns an error to the user. The error handling of the latter scenario involves recently developed logic and has been successfully demonstrated on a single-node, multi-worker process setup.

Worker failure during query execution can, in principle, have several manifestations.

If XRootD returns an error to the Qserv master in response to a request to open for write, Qserv will repeat request for open a fixed number (e.g. 5) of times. This has been demonstrated.
If XRootD returns an error to the Qserv master in response to a write, Qserv immediately terminates the query gracefully and returns an error to the user. This has been demonstrated. Note that this may be considered acceptable behavior (as opposed to attempting to recover from the error) since it is an unlikely failure-mode.
If XRootD returns an error to the Qserv master in response to a request to open for read, Qserv will attempt to recover by re-initializing the associated chunk query in preparation for a subsequent write. This is considered the most likely manifestation of worker failure and has been successfully demonstrated on a single-node, multi-worker process setup.
If XRootD returns an error to the Qserv master in response to a read, Qserv immediately terminates the query gracefully and returns an error to the user. This has been demonstrated. Note that this may be considered acceptable behavior (as opposed to attempting to recover from the error) since it is an unlikely failure-mode.

2.2.2 Data corruption¶

These tests are meant to simulate data corruption that might occur on disk, during disk I/O, or during communication over the network. We simulate these scenarios in one of two ways. 1) Truncate data read via XRootD by the Qserv master to an arbitrary length. 2) Randomly choose a single byte within a data stream read via XRootD and change it to a random value. The first test necessarily triggers an exception within Qserv. Qserv responds by gracefully terminating the query and returning an error message to the user indicating the point of failure (e.g. failed while merging query results). The second test intermittently triggers an exception depending on which portion of the query result is corrupted. This is to be expected since Qserv verifies the format but not the content of query results. Importantly, for all tests, regardless of which portion of the query result was corrupted, the error was isolated to the present query and Qserv remained stable.

2.2.3 Future tests¶

Much of the Qserv-specific fault tolerance logic was recently developed and requires additional testing. In particular, all worker failure simulations described above must be replicated within a multi-cluster setup.

2.3 Multiple Qserv Installations on a Single Machine¶

Once in operations, it will be important to allow multiple qserv instances to coexist on a single machine. This will be necessary when deploying new Data Release, or for testing new version of the software (e.g., MySQL, or Qserv). In the short term, it is useful for shared code development and testing on a limited number of development machines we have access to. We have successfully demonstrated Qserv have no architectural issues or hardcoded values such as ports or paths that would prevent us from running multiple instances on a single machine.

3 Spatial Join Performance¶

This section describes performance of spatial join queries as of early 2010.[‡]

In practice, we expect spatial joins on Object table only. To avoid the join on multi-billion row table, the table will be partitioned (chunked), as described in [5]. The queries discussed here correspond to a query executed on a single chunk; e.g., table partitioning and distribution of partitions across disks or nodes are not considered here. It is expected many such chunk-based queries will execute in parallel.

Used schema:

CREATE TABLE X (
  objectId  BIGINT NOT NULL PRIMARY KEY,
  ra        FLOAT NOT NULL,
  decl      FLOAT NOT NULL,
  muRA      FLOAT NOT NULL,
  muRAErr   FLOAT NOT NULL,
  muDecl    FLOAT NOT NULL,
  muDeclErr FLOAT NOT NULL,
  epoch     FLOAT NOT NULL,
  bMag      FLOAT NOT NULL,
  bFlag     FLOAT NOT NULL,
  rMag      FLOAT NOT NULL,
  rFlag     FLOAT NOT NULL,
  b2Mag     FLOAT NOT NULL,
  b2Flag    FLOAT NOT NULL,
  r2Mag     FLOAT NOT NULL,
  r2Flag    FLOAT NOT NULL
)

Used data: USNO catalog.

A first version of the query:

SELECT count(*)
FROM   X o1, X o2
WHERE  o1.objectId <> o2.objectId
  AND  ABS(o1.ra - o2.ra) < 0.00083 / COS(RADIANS(o2.decl))
  AND ABS(o1.decl - o2.decl) < 0.00083

Precalculating COS(RADIANS(decl)):

ALTER TABLE X ADD COLUMN cosRadDecl FLOAT
UPDATE X set cosRadDecl = COS(RADIANS(decl))

and changing the order of predicates as follows:

SELECT count(*)
FROM   X o1, X o2
WHERE  ABS(o1.ra - o2.ra) < 0.00083 / o2.cosRadDecl
  AND  ABS(o1.decl - o2.decl) < 0.00083
  AND  o1.objectId <> o2.objectId

improves the execution time by 36% for mysql, and 38% for postgres.

Here is the timing for this (optimized) query.

nRows	mysql	postgres
[K]	[sec]	[sec]
1	1	5
2	5	19
3	11	40
4	18	69
5	28	103
10	101	371
15	215	797
20	368
25	566

Postgres is ~3.7x slower than mysql in this case. It is probably possible to tune postgreSQL a little, but is it unlikely it will match MySQL performance.

Each of these queries for both mysql and postgres were completely CPU-dominated. The test was executed on an old-ish Sun 1503 MHz sparcv9 processor.

Also, see: [8] - the test rerun on a fast machine (dash-io).

3.1 Using index¶

The near neighbor query can be further optimized by introducing an index. Based on the tests we run with mysql, in order to force mysql to use an index for this particular query, we have to build a composite index on all used columns, or build a composite index on (ra, decl, cosRadDecl) and remove o1.objectId <> o2.objectId predicate (this predicate would have to be applied in a separate step). The timing for the query with index and without the objectId comparison:

nRows	was	now	faster
[K]	[sec]	[sec]	[%]
5	28	16.7	40
10	101	67.2	33
15	215	150.8	30
20	368	269.5	27
25	566	420.3	26

The speedup from using an index will likely be bigger for wider tables.

3.2 CPU utilization¶

How many CPUs do we need to do full correlation on 1 billion row table (DC3b)?

# chunks	rows/chunk	seconds per	total	total core-hours
		self-join	core-hours	if 16 cores used
		in 1 chunk	needed	twice faster
40,000	25,000	566	6,289	196
50,000	20,000	368	5,111	160
66,666	15,000	215	3,981	124
100,000	10,000	101	2,806	88
200,000	5,000	28	1,556	49
250,000	4,000	18	1,250	39
333,333	3,000	11	1,019	31
500,000	2,000	5	694	22
1,000,000	1,000	1	278

Realistically, we can count on ~2 8-core servers, ~twice faster than the CPUs used in these tests. That means 1 million chunk version would finish in 9 hours, 66K chunk version would need 5 days to finish.

3.3 Near neighbor without building sub-chunking¶

We tested the performance of running nn query without explicitly building sub-chunking. In all these tests describe here we tried to run nn on a 1 million row table. A starting point is 1000 sec (1 sec per 1K rows), which is ~16 min 40 sec.

First test: running near neighbor query by selecting rows with given subChunkId into in memory table and running nn query there. This test is here. It took 7 min 43 sec.

Second test: running near neighbor query by running neighbor once for each subChunkId, without building sub-chunks. This test is here. It took 39 min29 sec.

Third test: runnig near neighbor query by mini-near neighbor once for each subChunkId, without building sub-chunks, using in-memory table. This test is here. It took 13 min 13 sec.

3.4 Near neighbor with predicates¶

Note that full n² correlation without any cuts is the worst possible spatial join, rarely needed in practice. A most useful form of near neighbour search is a correlation with some predicates. Here is an example hypothetical (non-scientific) query, written for the data set used in our tests:

SELECT count(*)
FROM   X o1, X o2
WHERE  o1.bMag BETWEEN 20 AND 20.13
  AND  o2.rMag BETWEEN 19.97 AND 20.15
  AND  ABS(o1.ra - o2.ra) < 0.00083 / o2.cosRadDecl
  AND  ABS(o1.decl - o2.decl) < 0.00083
  AND  o1.objectId <> o2.objectId

For the data used, the applied o1.bMag cut selects ~2% of the table, so does the o2.rMag cut (if applied independently).

With these cuts, the near neighbour query on 25K-row table takes 0.3 sec in mysql and 1 sec in postgres (it’d probably run even faster with indexes on bMag and rMag). So if we used mysql we would need only 3 (1503 MHz) CPUs do run query over 1 billion rows in one hour.

For selectivity 20% it takes mysql 12 sec to finish and postgres needs 35 sec. In this case mysql would need 133 (1503 MHz) CPUs to run this query over 1 billion rows in one hour.

This clearly shows predicate selectivity is one of the most important factors determining how slow/fast the spatial queries will run. In practice, if the selectivity is <10%, chunk size = ~25K or 50K rows should work well.

Note

Perhaps we need to do more detailed study regarding predicate selectivity.

3.5 Numbers for lsst10¶

As of late 2009 the lsst10 server had 8 cores.

Elapsed time for a single job is comparable to elapsed time of 8 jobs run in parallel (difference is within 2-3%, except for very small partitions, where it reaches 10%).

Testing involved running 8 “jobs”, where each job was a set of queries executed sequentially. Each query was a near neighbor query:

SELECT count(*) AS neighbors
FROM   XX o1 FORCE INDEX (idxRDC),
       XX o2 FORCE INDEX (idxRDC)
WHERE  ABS(o1.decl - o2.decl) < 0.00083
  AND  ABS(o1.ra - o2.ra) < 0.00083 / o2.cosRadDecl

-- idxRDC was defined as "ADD INDEX idxRDC(ra, decl, cosRadDecl)"

Each query run on a single partition. Results:

rows per partition	seconds to self-join one partition	rows processed per elapsed sec	time to process 150m rows (DC3b)
0.5K	0.05	80K	31min
1K	0.16	50K	50min
2K	0.59	27K	1h 32min
5K	3.63	11K	3h 47min
10K	14.68	5.5K	7h 39min
15K	40.09	3K	14h

See 200909nearNeigh-lsst10.xls for details.

4 Building sub-partitions¶

We tested cost of building sub partitions on the fly.[§] This task involves taking a large partition and segregating rows into different tables, such that all rows with a given subChunkId end up in the same table.

4.1 Numbers for DC3b¶

In DC3b we will have ~150 million rows in the Object table.

A reasonable partitioning/sub-partitioning scheme:

1,500 partitions, 100K rows per partition.
a partition dynamically split into 100 sub-partitions 1K row each.

4.1.1 Justification¶

we want to keep the number of partitions <30K. Each partition = table = 3 files. 200K partitions would = 2GB of internal structure to be managed by xrootd (in memory).
we want relatively small sub-partitions (<2K rows) in order for near neighbor query to run fast, see the analysis above.
sub-partitions can’t be too small because the overlap will start playing big role, eg overlap over ~20% starts to become unacceptable.

Object density will not vary too much, and will be ~ 5 x 10⁵ / deg², or 140 objects / sqArcmin. (star:galaxy ratio will vary, but once we add both together it wont because the regions that have very high star densities also have very high extinction, so that background galaxies are very difficult to detect)

So a 1K-row subpartition will roughly cover 3x3 sq arcmin. Given SDSS precalculated neighbors for 0.5 arcmin distance, this looks reasonable (eg, the overlap is ~ 17% for the largest distance searched.)

4.2 Testing¶

So, based on the above, we tested performance of splitting a single 100k row table into a 100 1k tables.

4.2.1 Test 1: `CREATE TABLE SELECT FROM WHERE` approach¶

The simplest approach is to run in the loop

CREATE TABLE sp_xxx ENGINE=MEMORY SELECT * FROM XX100k where subChunkId=xxx;

for each sub chunk, where xxx is a subChunkId.

Timing: ~ 1.5 sec

4.2.2 Test 2: Segregating in a client program (ascii)¶

Second, we created a C++ program that does the following:

SELECT * from XX100k (reads the whole 100k row partition into memory)
segregate in memory rows based on subChunkId value
buffer them (a buffer per subChunkId)
flush each buffer to a table using INSERT INTO sp_xxx VALUES (1,2,3), (4,3,4), ..., (3,4,5)

Note that the values are sent from mysqld to the client as ascii, and sent back from the client back to the server as ascii too.

The client job was run on the same machine where the mysqld run.

Timing: ~3x longer than the first test. Only ~2-3% of the total elapsed time was spent in the client code, the rest was waiting for mysqld, so optimizing the client code won’t help.

In summary, it is worse than the first test.

4.2.3 Test 3: Segregating in a client (binary protocol)¶

We run a test similar to the previous one, but to avoid the overhead of converting rows to ascii we used the binary protocol. In practice,

reading from the input table was done through a prepared statement
writing to a subChunk tables was done through prepared statements INSERT INTO x VALUES (?,?,?), (?,?,?), ..., (?,?,?), and values were appropriately bound.

Note that a prep statement had to be created for each output table (table names can’t parameterized).

Relevant code is here.

Timing: ~2x longer than the first test. Similarly to the previous test, only 2-3% of time was spent in the client code.

Using CLIENT_COMPRESS flag makes things ~2x worse.

In summary, it is better then the previous test, but still worse than the first test.

4.3 Note on concurrency¶

So the test 1 is the fastest. It creates ~66 tables/second. Unfortunately, running 8 of these jobs in parallel (1 per core) on lsst10 takes ~8-9 sec, not 1.5 as one would expect (presumably due to mysql metadata contention). Writing from each core to a different database does not help.

4.4 Timing for DC3b¶

Assuming we use the “test 1” approach, and assuming we can do 1 partition in 1.5 sec (no speedup from multi-cores), it would take 37.5 min to sub-partition the whole Object table.

Based on the analysis above, we can do near neighbor join on 1K row table in 0.16 sec, so we need 16 sec to handle 100 such tables (equal 1 100k row partition). If we run it in parallel on 7 cores while the 8th core is sub-partitioning “next” partition, a single 100k-row partition would (theoretically, we didn’t try yet!) be done in 2.3 sec. That is 57 min for entire DC3b Object catalog, e.g., sub-partitioning costs us 7 min (near neighbor on 8 cores would finish in 50 min).

Each near-neighbor job does only one read access to the mysql metadata, so the contention on mysql metadata should not be a problem.

5 Sub-partitioning overhead [¶]¶

5.1 Introduction¶

As discussed in [5], it might be handy to be able to split a large chunk of a table on the fly into smaller sub-zones and sub-chunks.

To determine the overheads associated with this, we run some tests on a 1 million row table which in this case represented a reasonably sized chunk of a multi-billion row table. The used table had significantly less columns than the real Object table (72 bytes per row vs expected 2.5K). The test was run directly on the database server. Preparing for the test involved

Created a 1-million row table
Precalculated 10 subZoneIds such that each zone had the same (or almost the same) number of rows
Precalculated 100 subChunkIds (10 per zone) such that each chunk had the same (or almost the same) number of rows
Created an index on subChunkId. The index selectivity is 1%.

5.2 Conclusions¶

For SELECT * type queries, the measured overhead was ~18%. It will likely be even lower for our wide tables (Object table has 300 columns)

For SELECT the measured overhead was ~3600% (x36). However

if we use shared scans, this cost will be amortized over multiple queries. Also, it is very unlikely all queries using a shared scan will select one column
If it is a non-shared-scan query: we can put into subchunk tables only these columns that the query needs plus the primary key.

So in practice, the overhead of building sub-partitions seems acceptable.

See below how we determined all this.

5.3 Test one: determining cost of full table scan¶

First, to determine the “baseline” (how long it takes to run various queries that involve a single full table scan) we run the following 4 queries

SELECT * FROM XX                                               -- A
INSERT INTO memR SELECT * FROM XX                              -- B
SELECT SUM(bMag+rMag+b2Mag+r2Mag) FROM XX                      -- C
INSERT INTO memRSum SELECT SUM(bMag+rMag+b2Mag+r2Mag) FROM XX  -- D

Queries A and C represent the cost of querying the data, query B represents the minimum cost to build sub-chunks, and query D helps to determine the cost of sending the results to the client.

memR and memRSum tables are both in-memory.

After running them, we rerun them again to check the effect of potential caching. The timing for the first set was: 82.27, 10.32, 1.09, 1.06. The timing for the second set was almost the same: 81.86, 10.48, 1.04, 1.06, so it looks like caching doesn’t distort the results.

The table + indexes used for these tests = ~100MB in size, and they easily fit into 16GB of RAM we had available. We did not clean up the OS cache between tests, so it is fair to assume we are skipping ~2 sec which would be used to fetch the data from disk (which can do 55MB/sec)

Observations: it is expensive to format the results (cost of query B minus cost of query D = ~9 sec). It is even more expensive to send the results to the client application (cost of query A minus cost of query B = ~71 sec)

5.4 Test two: determining cost of dealing with many small tables¶

We divided 1 million rows across 100 small tables (10K rows per table). It took 1.10 sec to run the SELECT SUM query sequentially for all 100 tables, so the overhead of dealing with 100 smaller tables instead of one bigger is negligible.

A similar test with 1000 small tables (1K rows each) took 1.98 sec to run. Here, the overhead was visibly bigger (close to x2), but it is still acceptable.

Observation: the overhead of reading from many smaller tables instead of 1 big table is acceptable.

5.5 Test three: using subChunkIds¶

The test included creating one in-memory table, then for each chunk:

INSERT INTO memT SELECT * FROM XX WHERE subChunkId = <id>
SELECT SUM(bMag+rMag+b2Mag+r2Mag) FROM memT
TRUNCATE memT

It took 35.80 sec, and comparing with the “baseline” numbers (queries B and C) it was about 24 sec slower. Conclusion: WHERE subChunkId introduced 24 sec delay. With no chunking, our SELECT query would complete in ~1 sec, so the overall overhead is ~x36

Re-clustering the data based on the subChunkId index by doing:

myisamchk <baseDir>/subChunk/XX.MYI --sort-record=3

has minimal effect on the total execution cost. The likely reason is that the table used for the test easily fits in memory and it is not fetched from disk.

Doing the same test but with SELECT * instead of SUM(bMag+rMag+b2Mag+r2Mag) took 98.15 sec. Since the baseline query A took ~83 sec, in this case the overhead was only ~15 sec (18%).

5.6 Test four: using “skinny” subChunkIds¶

The test included creating one in-memory table CREATE TEMPORARY TABLE memT (magSums FLOAT, objectId BIGINT), then for each chunk:

INSERT INTO memT SELECT bMag+rMag+b2Mag+r2Mag, objectId FROM XX WHERE subChunkId = <id>
SELECT SUM(magSums) FROM memT
TRUNCATE memT

Essentially instead of blindly copying all columns to subchunks, we are coping only the minimum that is needed (including objectId which will always be needed for self-join queries). In this case, the test too 11.39 sec which is over 3x improvement comparing to the previous test.

5.7 Test five: sub-partitioning in a client program¶

We tried building sub-partitions through a client program (python). The code was looping through rows returned from SELECT * FROM XX, and for each row it

parsed the row
checked subChunkId
appended it to appropriate INSERT INTO chunk_ VALUES (...), (...) command

then it run the ‘insert’ commands.

Doing SELECT * FROM <table> takes 81 seconds, which is aligned with our baseline numbers (query A). Then remaining processing took 533 sec. It could probably be optimized a little, but it is not worth it, it is clear this is not the way to go.

5.8 Test six: stored procedure¶

Finally, we tried sub-partitioning through a stored procedure. Actually, not the whole sub-partitioning, but just the scanning. The following stored procedure was used:

CREATE PROCEDURE subPart()
BEGIN

  DECLARE objId BIGINT;
  DECLARE ra, decl DOUBLE;
  DECLARE muRA, muRAERr, muDecl, muDeclErr, epoch, bMag, bFlag,
          rMag, rFlag, b2Mag, b2Flag, r2Mag, r2Flag FLOAT;
  DECLARE subZoneId, subChunkId SMALLINT;

  DECLARE c CURSOR FOR SELECT * FROM XX;

  OPEN c;

  CURSOR_LOOP: LOOP
    FETCH C INTO objId, ra, decl, muRA, muRAERr, muDecl, muDeclErr, epoch, bMag, bFlag,
                 rMag, rFlag, b2Mag, b2Flag, r2Mag, r2Flag, subZoneId, subChunkId;

  END LOOP;

  CLOSE c;
END
//

So, this procedure only scans the data, it doesn’t insert rows to the corresponding sub-tables. It is roughly equivalent to the query B from the baseline test. This test took 21 seconds, almost the same as the corresponding baseline query B. Conclusion: pushing this computation to the stored procedure doesn’t visibly improve the performance.

6 MyISAM Compression Performance¶

This section[#] evaluates the compression of MyISAM tables in MySQL in order to better understand the benefits and consequences of using compression for LSST databases.

6.1 Overview¶

MyISAM is the default table persistence engine for MySQL, and offers excellent read-only performance without transactional or concurrency support. Although its performance is well-studied and generally understood for web-app and business database demands, we wanted to quantify its performance differences with and without compression in order to understand how the use of compression would affect LSST data access performance.

MyISAM packing compresses column-at-a-time.

6.2 1 minute summary¶

Compression works on astro data. Almost half of the tested table is floating point, and the data itself compressed about 50%. However, indexes did not seem to compress, and if we assume that their space footprint is 1.5x the raw data, the overall space savings is only 20%. Query performance is impacted, though probably less than twice as slow for short queries (perhaps 40-50%). Long queries that need table scans seem to slow down significantly as far as CPU, but the I/O savings has not been measured. When indexes are used, performance is about the same, though the indexes themselves are space-heavy and not compressed.

6.3 Goals¶

Ideally, we would like to understand the following:

What will we gain in storage efficiency? Compression should reduce the data footprint. However, while its effectiveness is understood for text, image (photos and line-art), audio, and machine code, it has been studied relatively little for scientific data, particularly astronomy data. We expect much, if not most, of LSST data to be measured floating-point values, which are likely to be completely unique. Without repeating values or sequences in astro data, compression effectiveness is unclear.
What is the performance/speed impact? Specifically:
- How expensive is it to compress and uncompress? Although bulk compression and decompression performance is probably not that important, it must be reasonable. Supra-linear increases of time with respect to size are probably not acceptable.
- How will it affect query and processing performance? Although we expect the query performance to be limited by I/O problems like disk bandwidth, interface/bus bandwidth, and memory bandwidth, compression could drive the CPU usage high enough to become a new bottleneck. Still, in many cases, the reduced data traffic could improve performance more than the decompression cost.

6.4 Test Conditions¶

We loaded the database with source data from the USNO. It has the following schema:

RA (decimal)
DEC (decimal)
Proper motion RA / yr (milli-arcsec)
error in RA pm / yr (milli-arcsec)
Proper motion in DEC / yr (milli-arcsec)
error in DEC pm / yr (milli-arcsec)
epoch of observations in years with 1/10yr increments
B mag
flag
R mag
flag
B mag2
flag2
R mag2
flag2

More notes that were bundled with the data:

which is a truncated set, but we had it on disk at IGPP.  We can live with
out the other fields.

flags are a measure of how extended the detection was with 0 being
extremely extended and 11 being just like the stellar point spread
function.

We added an “id” auto-increment field as a primary key.

The corresponding table:

Field	Type	Null	Key	Default	Extra
id	bigint(20)	NO	PRI	NULL	auto_increment
ra	float	YES		NULL
decl	float	YES		NULL
pmra	int(11)	YES		NULL
pmraerr	int(11)	YES		NULL
pmdec	int(11)	YES		NULL
pmdecerr	int(11)	YES		NULL
epoch	float	YES		NULL
bmag	float	YES	MUL	NULL
bmagf	int(11)	YES		NULL
rmag	float	YES		NULL
rmagf	int(11)	YES		NULL
bmag2	float	YES		NULL
bmagf2	int(11)	YES		NULL
rmag2	float	YES		NULL
rmag2f	int(11)	YES		NULL

6.4.1 Relevance to real data¶

This table is narrow compared to the expected column count for LSST, and only 7 of 16 columns are floating point fields. It may be appropriate to drop/add some of the integer columns so that the fraction of floating point columns is the same as what we will be expecting in LSST. For reference, here are the type distributions of columns for the LSST Object table.

type	DC3a	DC3b (estim)
FLOAT	30	142
DOUBLE	14	42
other	15	21
total	59	205

DC3a [2] is about 3/4 floating-point, and the current DC3b [11] estimate is about 90% floating point.

Removing the all the non-float/double columns from the table leaves 7 float columns and 1 bigint column, giving ~88% floating point values. The resulting truncated table was tested as well, although its bit-entropy may not match real data better than the original int-heavy table.

6.4.2 Hardware configuration¶

We tested on lsst-dev03, which is a Sun Fire V240 with Solaris 10 and 16G of memory. The database was stored on a dedicated (for practical purposes) disk that was 32% full after loading. MySQL 5.0.27 was used.

6.4.3 Test Queries¶

q1: Retrieve one entire row

SELECT *
FROM  %s
WHERE id = 40000;

q2: Retrieve about 10% of rows

SELECT  *
FROM    %s
WHERE   bmag - rmag > 10;

q1b: Retrieve bmag from one row

SELECT bmag
FROM  %s
WHERE id = 40000;

q2b: Retrieve bmag from about 10% of rows

SELECT  bmag
FROM    %s
WHERE   bmag - rmag > 10;

6.4.4 Numbers¶

Bulk Performance for 100 million rows: Times in seconds

	w/o idx	w/ idx
myisampack	1418	1387
Pack-repair	374	3200
Unpack	412	2959
Unpack-repair	102	2689
Total pack	1792	4587

Truncated table (7/8 float)

	w/o idx	w/ idx
myisampack	806	822
Pack-repair	231	2567
Unpack		2414
Unpack-repair		2194
Total pack		3389
Total pack row/s	9.64e4	2.95e4
Total pack byte/s	3.57e6	1.72e6

6.4.4.1 Sizes:¶

	Table	Index	Table(trunc)	Index(trunc)
Uncompressed	7000M	1442M	3700M	2140M
Compressed	2001M	2142M	1703M	2140M

(in 1M = 10⁶)

Table size is reduced to 29% of original (49% if indexes are included). The truncated table compresses somewhat less: 46% of original (66% if indexes are included). Since the indexes do not compress (indeed, they may expand), they must be considered in any compression evaluation.

6.4.4.2 Query Performance:¶

Test 1:

	uncompressed		packed
	No idx	Id/bmag idx	No idx	Id/bmag idx
q1: sel row	63.8	0.15	237	0.11
q2: filter	304.9	303.8	327.2	307.5

Notes: the uncompressed q1 time is fishy. I couldn’t reproduce it.

Test 2:

	packed		Unpacked
	no idx	id/bmag idx	no idx	id/bmag idx
q1	270.16	0.07	192.54	0.05
q2	290.35	307.2	407.29	309.06
q1b	267.24	0.22	192.6	0.16
q2b	178.09	152.05	242.38	154.01

Additional runs:

Unpacked		Unpacked (w/flush)		Unpacked (flush+restart)
no idx	id/bmag idx	no idx	id/bmag idx	no idx	id/bmag idx
192.36	0.06	143.23	0.04	97.96	0.05
407.24	255.23	272.96	255.82	265.14	283.37
192.65	0.16	136.1	0.17	100.64	0.17
242.61	92.41	100.01	92.02	92.45	196

Relative times: (packed/unpacked time, test 2)

	No index	Index
q1	1.4	1.46
q2	0.71	0.99
q1b	1.39	1.41
q2b	0.73	0.99

6.5 Truncated performance¶

We re-evaluated query performance on the truncated table. To enhance reproducibility, we ran query sets three times under each condition, flushing tables before the repeated sets (flush, then 4 queries, thrice). OS buffers were not cleared, so the results should indicate fully disk-cached performance.

Unpacked:

	Id/bmag index			No index
	1	2	3	1	2	3
q1	0.05	0.04	0.04	56.31	38.53	38.53
q2	230.09	206.85	207.34	206.43	206.2	206.05
q1b	0.13	0.15	0.15	38.53	44.12	38.54
q2b	100.38	103.55	91.8	84.21	84.15	84.28

Packed:

	Id/bmag index			No index
	1	2	3	1	2	3
q1	0.09	0.05	0.05	178.87	135.3	135.22
q2	339.21	307.46	307.14	307.99	307.53	307.69
q1b	0.15	0.26	0.13	137.04	135.31	135.43
q2b	185.73	183.39	186.06	184.26	184.26	184.3

Avg (2nd/3rd runs) compression penalty:

	id/bmag idx	no idx
q1	0.05	0.72
q2	0.33	0.33
q1b	0.23	0.69
q2b	0.47	0.54

Error calculations:

	avg perf packed		avg perf unpacked		rms error packed		rms error unpacked
	id/bmag idx	no idx	id/bmag idx	no idx	id/bmag idx	no idx	id/bmag idx	no idx
q1	0.05	135.26	0.04	38.53	0	0.04	0	0
q2	307.3	307.61	207.1	206.13	0.16	0.08	0.25	0.07
q1b	0.2	135.37	0.15	41.33	0.07	0.06	0	2.79
q2b	184.73	184.28	97.68	84.21	1.34	0.02	5.88	0.07

6.6 Results discussion¶

Since re-running and timing the queries on the truncated tables with an eye towards consistency and reproducibility, the picture has become a bit different. This discussion will be limited to the truncated table results, where the second and third runs are largely consistent as they should be fully cached.

Indexes only helped for the queries that selected single rows. Wherever an index was available and exploited, compression did not have a significant impact. Note that q1b with indexes seems worse with compression, but the difference is within the rms error (34% for q1b packed, w/indexes) for its timings.

Since indexes are never compressed, this makes sense. If the query exploits an index, compression should not matter since the index is structurally the same regardless of compression.

For those queries which were not optimized by indexes, compressed performance was a lot worse. With indexes, the degradation was between 5% to 47% and without indexes, between 33% to 72%.

Since these are timings for the fully-cached (OS and MySQL) conditions, bandwidth savings, the only possible benefit of compression, is ignored, and we measure only the CPU impact.

Since we expect query performance to be disk-limited, and compression effectively reduces the table sizes by 50% or more (excluding indexes), we should test further and include I/O effects. In particular, the balance between CPU performance and disk performance is crucial. The test machine’s generous memory capacity (16GB) relative to the table size (7GB uncompressed) makes uncached testing difficult.

6.6.1 Measurement variability¶

Results seemed largely reproducible until additional steps were taken to control conditions (via “flush table” and server restarting). Oddly enough, flushing tables or restarting the server seemed to improve performance for non-indexed, non-compressed situations (compressed tables have not been retested).

So far, none of the tests have included flushing OS disk buffers, since we don’t know of a quick way in Solaris (and the machine has 16G of memory).

MySQL performance itself seems to be quite complex, and sometimes surprising (should “flush table” improve performance?). To get better than factor-of-two estimates, we need to control conditions more aggressively and retest.

6.7 Take-home messages (“conclusions”)¶

Compression and uncompression take reasonable amounts of time.
Rebuilding indexes is about expensive as the compression/uncompression
Compression could effectively double the amount of store-able data, should the data be similar in distribution and variability to the USNO data.
Despite the measurement variability, we can guess that table scans are impacted 40-60% in CPU (while I/O is cut about 50%), and short row-retrieval takes a similar hit (though less than 2x). Where indexes are exploited, the difference seems small (as expected).

7 Storing Reference Catalog[♠]¶

Assumptions (based on DataAccWG telecon discussions 2009 Oct 23, Oct 6, Oct 2):

data will come from several sources (USNO-B, simCat, maybe SDSS, maybe 2MASS [TBD])
some fields will go over the pole
we should keep reference catalogs for simCat and all-the-rest separate (almost nothing will match)
size of the master reference catalog (MRC) will be ~1 billion rows (USNO-B), augmented by some columns (maybe from SDSS, maybe 2MASS). For now, assuming 100 bytes/row, that is 100GB [need to come up with schema]
we will need to extract a subset of rows to create individual reference catalogs:
- astrometric reference catalog (simCat, non-simCat)
- photometric reference catalog (simCat, non-simCat)
- SDQA astrometric reference catalog (simCat, non-simCat). It will need only bright, isolate stars [open question: maybe we don’t need separate SDQA cat]
- SDQA photometric reference catalog (simCat, non-simCat) - this is beyond DC3b [see open question above]
each reference catalog will be ~ 1% of the entire MRC, so ~1GB
some objects will be shared by multiple reference catalogs
we do not need to worry about updating reference catalogs (e.g., the input source catalogs are frozen) - this is true for at least USNOB and SDSS.
typical access pattern “give me all objects for a given CCD” - this implies a cone search with a ccd radius in ccd center, then would need to clip the edges.

7.1 Building the Reference Catalogs¶

ObjectIds from different data sources will not match, and we will need to run association (a la Association Pipeline) to synchronize them (this is a heavy operation).

Given the assumptions above and the results of the tests below, the best option seems to be:

build a MRC (join data from all sources into one large table)
extract reference catalogs and store separately (they are small, ~1 GB each in DC3b)
could push the MRC to tapes if we need to recover some disk space
should we ever need to introduce a new version of a source catalog - we would create a new MRC.

SDQA Catalogs: - want bright stars: need to apply filter on magnitude - want isolated stars: need to identify object without near neighbors - see this page for more details

The former is easy. The latter will require some thinking. It is a one-time operation, and we are talking about few million rows (1% of 1 billion = 10 millions, cut on magnitude will further reduce it). Options:

- maybe use custom C++ code, generate on the fly overlapping declination zones and stream by zone - maybe use DIF-spatial-index-assisted join

[This needs investigation]

Spatial index: we are considering using DIF [14].

Todo

write up DIF capabilities
test it
also try InnoDB with cluster index on healpixId

7.2 Description of the Tests¶

The tests were done on lsst-dev01 (2 CPUs), using /u1/ file system which can do ~45 MB/sec sequential read (tested using dd if=/u1/bigFile of=/dev/zero)

Input data: USNO-B catalog, 1 billion rows (65GB in csv form). Schema:

CREATE TABLE XX (
  # objectId  BIGINT NOT NULL PRIMARY KEY, -- added later
  ra        REAL NOT NULL,
  decl      REAL NOT NULL,
  muRA      FLOAT NOT NULL,
  muRAErr   FLOAT NOT NULL,
  muDecl    FLOAT NOT NULL,
  muDeclErr FLOAT NOT NULL,
  epoch     FLOAT NOT NULL,
  bMag      FLOAT NOT NULL,
  bFlag     FLOAT NOT NULL,
  rMag      FLOAT NOT NULL,
  rFlag     FLOAT NOT NULL,
  b2Mag     FLOAT NOT NULL,
  b2Flag    FLOAT NOT NULL,
  r2Mag     FLOAT NOT NULL,
  r2Flag    FLOAT NOT NULL
);

7.3 Quick Summary¶

Speed of loading is reasonable (~1 h 20 min)
Speed of building indexes is acceptable (~5 h)
Selecting small number of rows via index is very fast
Speed of full table scan is at ~85% of raw disk speed (30 min)
Selecting large number of rows via index (full index scan) is unacceptably slow (8 h)
Speed of join is relatively slow: at 15% of raw disk speed (~5 h)

Details are given below.

7.3.1 Data ingest¶

Loading data done via

LOAD DATA INFILE 'x.csv' INTO TABLE XX FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'"

took 1h 19 min.

iostat showed steady 27 MB/sec write I/O, disk was 18% busy, CPU 90% busy.

The db file size (MYD): 60 GB

7.3.2 Building key¶

Command:

alter table add key(bMag)

CPU ~30% busy. Disk ~75% busy doing 41 MB/sec write and 21 MB/sec read

Took 5h 11min

MYI file size: 12 GB

7.3.3 Adding primary key¶

(This is not a typical operation - we won’t be doing it in LSST)

Command:

ALTER TABLE XX
ADD COLUMN objectId BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;

Took 15h 36min

MYD grew to 68 GB, MYI grew to 26 GB

Index for objectId takes 14 GB. (theoretically, 1 billion x 8 bytes is 8 GB, so 43% overhead)

7.3.4 Full table scan¶

Command:

-- note, there is no index on r2Mag
select count(*) from XX where r2Mag > 5.6

Takes 30 min

iostat shows between 36-42 MB/sec read I/O, which makes sense (68 GB MYD file in 30 min = 38 MB/sec). This is 84% of the best this disk can do.

7.3.5 Selecting through index¶

7.3.5.1 Small # rows¶

Selecting a small number of rows through index takes no time

7.3.5.2 Large # rows (full index scan)¶

-- there is an index on bMag
select count(*) from XX where  bMag > 4;

Took 7h 55 min

The result = 370 milion rows (37% of all rows)

iostat shows there a lot of small io (1MB/sec), which is consistent with elapsed time: IDX file size = 27.4 GB –> 1MB/sec. It looks like mysql is inefficiently walking through the index (tree), fetching pieces from disk in small chunks.

Note that a similar query but selecting small number of rows

-- this returns 0 rows
select count(*) from XX where  bMag > 40;

Takes no time.

As expected, disabling indexes and rerunning query takes 29 min 14 sec (full table scan)

7.3.6 Speed of join¶

We may want to join USNO-B objects with eg SDSS objects on the fly (assuming objectId are synchronized)...

-- create dummy sdss table
CREATE TABLE sdss (
  objectId  BIGINT NOT NULL PRIMARY KEY,
  sdss1     FLOAT NOT NULL,
  sdss2     FLOAT NOT NULL,
  sdss3     FLOAT NOT NULL,
  sdss4     FLOAT NOT NULL,
  sdss5     FLOAT NOT NULL,
  sdss6     FLOAT NOT NULL
);

-- in this case it is 25% of usno rows
INSERT INTO sdss
SELECT objectId, bMag, bFlag, rMag, rFlag,b2Mag, b2Flag
FROM XX
WHERE objectId % 4 = 0;

The loading took 1 h 21 min.

Doing the join:

CREATE TABLE refCat1
SELECT * FROM XX
JOIN sdss using (objectId);

Takes 3 h 54 min

It involved reading 68+8 GB and writing 23 GB. That gives speed 7 MB/sec, which is ~15% of the raw disk speed.

7.3.7 Full USNOB-1 Schema¶

In Informix format (easy to map to MySQL)

create table "informix".usno_b1
  (
    usno_b1 char(12) not null ,
    tycho2 char(12),
    ra decimal(9,6) not null ,
    dec decimal(8,6) not null ,
    e_ra smallint not null ,
    e_dec smallint not null ,
    epoch decimal(5,1) not null ,
    pm_ra integer not null ,
    pm_dec integer not null ,
    pm_prob smallint,
    e_pm_ra smallint not null ,
    e_pm_dec smallint not null ,
    e_fit_ra smallint not null ,
    e_fit_dec smallint not null ,
    ndet smallint not null ,
    flags char(3) not null ,
    b1_mag decimal(4,2),
    b1_cal smallint,
    b1_survey smallint,
    b1_field smallint,
    b1_class smallint,
    b1_xi decimal(4,2),
    b1_eta decimal(4,2),
    r1_mag decimal(4,2),
    r1_cal smallint,
    r1_survey smallint,
    r1_field smallint,
    r1_class smallint,
    r1_xi decimal(4,2),
    r1_eta decimal(4,2),
    b2_mag decimal(4,2),
    b2_cal smallint,
    b2_survey smallint,
    b2_field smallint,
    b2_class smallint,
    b2_xi decimal(4,2),
    b2_eta decimal(4,2),
    r2_mag decimal(4,2),
    r2_cal smallint,
    r2_survey smallint,
    r2_field smallint,
    r2_class smallint,
    r2_xi decimal(4,2),
    r2_eta decimal(4,2),
    i_mag decimal(4,2),
    i_cal smallint,
    i_survey smallint,
    i_field smallint,
    i_class smallint,
    i_xi decimal(4,2),
    i_eta decimal(4,2),
    x decimal(17,16) not null ,
    y decimal(17,16) not null ,
    z decimal(17,16) not null ,
    spt_ind integer not null ,
    cntr serial not null
  );

revoke all on "informix".usno_b1 from "public" as "informix";

create index "informix".usno_b1_cntr on "informix".usno_b1 (cntr)  using btree ;
create index "informix".usno_b1_dec on "informix".usno_b1 (dec)    using btree ;
create index "informix".usno_b1_spt_ind on "informix".usno_b1     (spt_ind) using btree ;

8 DC2 Database Partitioning Tests¶

These tests [♥] are aimed at determining how to partition the Object and DIASource tables to support efficient operation of the Association Pipeline (AP). The task of the AP is to take new DIA sources produced by the Detection Pipeline (DP), and compare them with everything LSST knows about the sky at that point. This comparison will be used to generate alerts that LSST and other observatories can use for followup observations, and is also used to bring LSSTs knowledge of the sky up to date.

The current AP design splits processing of a Field-of-View (FOV) into 3 phases. For context, here is a brief summary:

prepare phase : This phase of the AP is in charge of loading information about the sky that falls within (or is in close proximity to) a FOV into memory. We will know the location of a FOV at least 30 seconds in advance of actual observation, and this phase of the AP will start when this information becomes available. The Object, DIASource, and Alert tables contain the information we will actually be comparing new DIA sources against. Of these, Object is the largest, DIASource starts out small but becomes more and more significant towards the end of a release cycle, and Alert is relatively trivial in size.

compare-and-update phase : This phase takes new DIASources (produced by the DP) and performs a distance based match against the contents of Object and DIASource. The results of the match are then used to retrieve historical alerts for any matched objects. The results of all these matches and joins are sent out to compute nodes for processing - these compute nodes decide which objects must be changed (or possibly removed), which DIA sources correspond to previously unknown objects, and which of them are cause for sending out an alert. At this point, the AP enters its final phase.

post-processing : The responsibility of this phase is to make sure that changes to Object (inserts, updates, possibly deletes), DIASource (inserts only), and Alert (inserts only) are present on disk. There is some (TODO: what is this number) amount of time during which we are guaranteed not to revisit the same FOV.

Note that LSST has a requirement to send out alerts within 60 seconds of image capture (there is a stretch goal of 30 seconds). Of the 3 AP phases, only compare-and-update is in the critical timing path. The telescope will be taking data for 1 FOV every 37 seconds: 15 sec exposure, 2 sec readout, 15 sec exposure, 5 sec readout and slew.

This is illustrated by the following (greatly simplified) diagram:

In this diagram, processing of a FOV starts right after observation with the image processing pipeline (IPP) which is followed by the DP, and finally the AP. The yellow and red boxes together represent processing which must happen in the 60 second window. Please note the boxes are not drawn to scale - IPP and DP are likely to take up more of the 60 second window than the diagram suggests. Also note that interaction with the moving object pipeline (MOPS) is omitted, but that there is some planned interaction between it and the AP (notably when a DIA source is mis-classified as an object rather than a moving object).

The database tests are currently focused on how to partition Object and DIASource such that the prepare phase is as fast as possible, and on how to perform the distance based crossmatch of the compare-and-update phase. Tests of database updates, inserts, and of how quickly such changes can be moved from in-memory tables to disk based tables will follow at a later date.

The tests are currently being performed using the USNO-B catalog as a stand-in for the Object table. USNO-B contains 1045175763 objects, so is a bit less than half way to satisfying the DC2 requirement of simulating LSST operations at 10% scale for DR1 (23.09 billion objects).

8.1 Code¶

The code for the tests is available here. The following files are included:

`lsstpart/Makefile`	Builds `objgen`, `chunkgen`, and `bustcache`
`lsstpart/bustcache.cpp`	Small program that tries to flush OS file system cache
`lsstpart/chunkgen.cpp`	Generates chunk and stripe descriptions
`lsstpart/crossmatch.sql`	SQL implementation of distance based crossmatch using the zone algorithm [10]
`lsstpart/distribution.sql`	SQL commands to calculate the spatial distribution of the Object table (used to pick test regions and to generate fake Objects)
`lsstpart/genFatObjectFiles.py`	Generates table schemas for “fat” tables
`lsstpart/objgen.cpp`	Program for generating random (optionally perturbed) subsets of tables, as well as random positions according to a given spatial distribution
`lsstpart/prepare.bash`	Loads USNO-B data into Object table
`lsstpart/prepare_chunks.bash`	Creates coarse chunk tables from Object table
`lsstpart/prepare_diasource.bash`	Uses `objgen` to pick random subsets of Object in the test regions (with perturbed positions). These are used to populate a fake DIASource table
`lsstpart/prepare_fine_chunks.bash`	Loads fine chunk tables from Object
`lsstpart/prepare_stripes.bash`	Loads stripe tables (indexes and clusters on ra) for the test regions from Object
`lsstpart/prepare_zones.bash`	Takes stripe tables generated by `prepare_stripes.bash` and clusters them on (zoneId, ra)
`lsstpart/prng.cpp`	Pseudo random number generator implementation
`lsstpart/prng.h`	Pseudo random number generator header file
`lsstpart/schema.sql`	Test database schema
`lsstpart/stripe_vars.bash`	Variables used by stripe testing scripts
`lsstpart/test_chunks.bash`	Test script for the coarse chunking approach
`lsstpart/test_fine_chunks.bash`	Test script for the fine chunking approach
`lsstpart/test_funs.bash`	Common test functions
`lsstpart/test_regions.bash`	Ra/dec boundaries for the test FOVs
`lsstpart/test_stripes.bash`	Test script for the stripe approach with ra indexing
`lsstpart/test_zones.bash`	Test script for the stripe approach with (zoneId, ra) indexing

8.2 Partitioning Approaches¶

8.2.1 Stripes¶

In this approach, each table partition contains Objects having positions falling into a certain declination range. Since a FOV will usually only overlap a small ra range within a particular stripe, indexes are necessary to avoid a table scan of the stripe when reading data into memory.

There are two indexing strategies under consideration here - the first indexes each stripe on ra (and also clusters data on ra). The second indexes and clusters each stripe on (zoneId, ra). In both cases the height of a stripe is 1.75 degrees. The height of a zone is set to 1 arcminute.

8.2.2 Chunks¶

In this approach, a table partition corresponds to an ra/dec box - a chunk - on the sky. See chunkgen.cpp for details on how chunks are generated. Basically, the sky is subdivided into stripes of constant height (in declination), and each stripe is further split into chunks of constant width (in right ascension). The number of chunks per stripe is chosen such that the minimum distance between two points in non adjacent chunks of the same stripe never goes below some limit.

There are no indexes kept for the on-disk tables at all - not even a primary key - so each chunk table is scanned completely when read. Two chunk granularities are tested: the first partitions 1.75 degree stripes into chunks at least 1.75 degrees wide (meaning about 9 chunks must be read in per FOV), the second partitions 0.35 degree stripes into chunks at least 0.35 degrees wide (so about 120 chunks must be examined per FOV).

So with coarse chunks, each logical table (Object, DIASource) is split into 13218 physical tables. With fine chunks, 335482 physical tables are needed. Due to OS filesystem limitations, chunk tables will have to be distributed among multiple databases (this isn’t currently implemented).

8.3 Testing¶

8.3.1 Hardware¶

SunFire V240
2 UltraSPARC IIIi CPUs, 1503 MHz
16 GB RAM
2 Sun StoreEdge T3 arrays, 470GB each, configured in RAID 5, sustained sequential write speed (256KB blocks) 150 MB/sec, read: 146 MB/sec
OS: Sun Solaris sun4x_510
MySQL: version 5.0.27

8.3.2 General Notes¶

Each test is run with both “skinny” objects (USNO-B with some additions, ~100bytes per row), and “fat” objects (USNO-B plus 200 DOUBLE columns set to random values, ~1.7kB per row) that match the expected row size (including overhead) of the LSST Object table. Any particular test is always run twice in a row: this should shed some light on how OS caching of files affects the results. Between sets of tests that touch the same tables, the bustcache program is run in an attempt to flush the operating system caches (it does this by performing random 1MB reads from the USNO-B data until 16GB of data have been read).

DIA sources were generated using the objgen program to pick a random subset of Object in the test FOVs. Approximately 1 in 100 objects were picked for the subset, and each then had its position perturbed according to a normal distribution with sigma of 2.5e-4 degrees (just under 1 arcsecond).

8.3.3 Test descriptions¶

Read Object data from the test FOVs into in-memory table(s) with no indexes whatsoever.
Read Object data from the test FOVs into in-memory table(s) that have the required indexes created before data is loaded. These indexes are:
- a primary key on id (hash index) and
- a B-tree index on zoneId for fine chunks, or (for all others) a composite index on (zoneId, ra).
Read Object data from the test FOVs into in-memory table(s) with no indexes whatsoever, then create indexes (the same ones as above) after loading finishes.

Note that all tests except the fine chunking tests place objects into a single InMemoryObject table (and DIA sources into a single InMemoryDIASource table) which are then used to for cross-matching tests. The fine chunking tests place each on-disk table into a separate in-memory table. This complicates the crossmatch implementation somewhat, but allows for reading many chunk tables in parallel without contention on inserts to a single in-memory table. It allows crossmatch to be parallelized by having different clients call the matching routine for different sub-regions of the FOV (each client is handled by a single thread on the MySQL server).

The following variations on the basic crossmatch are tested (all on in-memory tables):

Use both match-orders: objects to DIA sources and DIA sources to objects
Test both slim and wide matching:
- a slim match stores results simply as pairs of keys by which objects and DIA sources can be looked up
- a wide match stores results as a key for a DIA source along with the values of all columns for the matching object. This is important for the fine chunk case, since looking up objects becomes painful when they can be in one of many in-memory tables.

Note that all crossmatches are using a zone-height of 1 arc-minute and a match-radius of 0.000833 degrees (~3 arcseconds).

8.4 Performance Results¶

Crossmatch performance is largely independent of the partitioning approach (there is slight variation since the various approaches don’t necessarily read the same number of objects into memory) except in the fine chunking case where some extra machinery comes into play.

8.4.1 Stripes with ra indexes: Performance Summary¶

For the high density FOV

3044468 objects are read into memory
? objects read from disk

For the low density FOV

76073 objects are read into memory
? objects read from disk

For the average density FOV

373763 objects are read into memory
? objects read from disk

8.4.1.1 Summary¶

Below, elapsed times for multiple runs of every test performed are listed.

Test	FOV	Row Size	Time - 1st run
Load test	High density	Skinny	1m1.231s
		Fat	5m4.386s
	Low density	Skinny	0m2.018s
		Fat	0m8.196s
	Avg. density	Skinny	0m8.067s
		Fat	0m38.553s

Load and index test	High density	Skinny	1m30.129s
		Fat	5m46.250s
	Low density	Skinny	0m2.511s
		Fat	0m8.971s
	Avg. density	Skinny	0m10.991s
		Fat	0m42.360s

Load then index test	High density	Skinny	1m50.968s
		Fat	14m10.113s
	Low density	Skinny	0m2.735s
		Fat	0m23.783s
	Avg. density	Skinny	0m12.012s
		Fat	1m18.091s

8.4.2 Stripes with (zoneId,ra) indexes: Performance Summary¶

For the high density FOV

3044282 objects are read into memory
- note: this is less than the other stripe test (which filters by ra and dec rather than ra and zoneId) due to an off by one error in the zone bounds I was using for the insert query. This slight error doesn’t warrant rerunning the test.
? objects read from disk

For the low density FOV

76274 objects are read into memory
? objects read from disk

For the average density FOV

373804 objects are read into memory
? objects read from disk

Currently the load test reads in the high density FOV with fat rows in 133 minutes (many times slower than all other approaches). The test is completely IO bound (disk is 80-98% busy, transfer rate is between about 800kB/s and 4MB/s with an average closer to 1MB/s, ~120 r/s sustained). Performance is seek limited - we could attack this problem by putting the indexes for each stripe on separate disks from the data, and furthermore by making sure adjacent stripes don’t share any disk. Due to the extremely poor performance, full test results are omitted. Note also that clustering the stripe tables involved in the test took almost a full week (the tables involved too large for in-memory sorting).

Todo

Setup and test scripts should be double checked for bugs.

8.4.3 Coarse Chunks: Performance¶

For the high density FOV

3044468 objects are read into memory
5865220 objects read from disk

For the low density FOV

76073 objects are read into memory
? objects read from disk

For the average density FOV

373763 objects are read into memory
? objects read from disk

8.4.3.1 Summary¶

Below, elapsed times for multiple runs of every test performed are listed.

Test	FOV	Row Size	Time - 1st run
Load test	High density	Skinny	0m36.023s
		Fat	4m52.873s
	Low density	Skinny	0m1.969s
		Fat	0m9.818s
	Avg. density	Skinny	0m5.517s
		Fat	0m39.668s

Load and index test	High density	Skinny	1m0.011s
		Fat	5m27.291s
	Low density	Skinny	0m2.488s
		Fat	0m10.513s
	Avg. density	Skinny	0m8.013s
		Fat	0m42.883s

Load then index test	High density	Skinny	1m14.078s
		Fat	13m56.050s
	Low density	Skinny	0m3.178s
		Fat	0m25.482s
	Avg. density	Skinny	0m9.924s
		Fat	1m7.057s

8.4.4 Fine Chunks: Performance¶

For the high density FOV

3596852 objects are read into memory
3596852 objects read from disk

For the low density FOV

105847 objects are read into memory
105847 objects read from disk

For the average density FOV

446289 objects are read into memory
446289 objects read from disk

8.4.4.1 Summary¶

Below, elapsed times for multiple runs of every test performed are listed.

Test	FOV	Row Size	Time - 1st run
Load test	High density	Skinny	0m42.217s
		Fat	4m50.362s
	Low density	Skinny	0m3.589s
		Fat	0m22.299s
	Avg. density	Skinny	0m16.066s
		Fat	0m50.490s

Load and index test	High density	Skinny	1m1.573s
		Fat	5m19.527s
	Low density	Skinny	0m4.206s
		Fat	0m24.154s
	Avg. density	Skinny	0m19.058s
		Fat	0m54.383s

Load then index test	High density	Skinny	1m31.118s
		Fat	8m8.764s
	Low density	Skinny	0m5.717s
		Fat	0m52.224s
	Avg. density	Skinny	0m21.408s
		Fat	1m30.905s

8.4.5 Crossmatch performance¶

Zone height is 1 arc-minute and match distance is 3 arcseconds for all tests.

A single in-memory table of objects matched against a single in-memory table of DIA sources (and vice versa) :

FOV	Row size	match-width	match direction	Time - 1st run	Time - 2nd run	Time - 3rd run
High density	Skinny	slim	objects vs. DIA sources	0m51.989s	0m52.081s	0m52.169s
			DIA sources vs. objects	0m35.692s	0m35.371s	0m35.442s
		wide	objects vs. DIA sources	0m52.489s	0m52.498s	0m52.521s
			DIA sources vs. objects	0m36.069s	0m36.180s	0m35.805s
	Fat	slim	objects vs. DIA sources	1m4.088s	1m4.085s	1m3.906s
			DIA sources vs. objects	0m41.543s	0m41.763s	0m41.480s
		wide	objects vs. DIA sources	1m19.124s	1m20.902s	1m11.180s
			DIA sources vs. objects	0m57.569s	1m0.377s	0m48.830s
Low density	Skinny	slim	objects vs. DIA sources	0m1.284s	0m1.278s	0m1.278s
			DIA sources vs. objects	0m0.947s	0m0.933s	0m0.945s
		wide	objects vs. DIA sources	0m1.307s	0m1.285s	0m1.283s
			DIA sources vs. objects	0m0.980s	0m0.961s	0m0.950s
	Fat	slim	objects vs. DIA sources	0m1.582s	0m1.596s	0m1.567s
			DIA sources vs. objects	0m1.109s	0m1.108s	0m1.104s
		wide	objects vs. DIA sources	0m1.995s	0m2.000s	0m1.883s
			DIA sources vs. objects	0m1.568s	0m1.580s	0m1.459s
Avg. density	Skinny	slim	objects vs. DIA sources	0m5.891s	0m5.887s	0m5.926s
			DIA sources vs. objects	0m4.085s	0m4.174s	0m4.148s
		wide	objects vs. DIA sources	0m5.991s	0m5.951s	0m5.939s
			DIA sources vs. objects	0m4.211s	0m4.199s	0m4.161s
	Fat	slim	objects vs. DIA sources	0m7.392s	0m7.662s	0m7.356s
			DIA sources vs. objects	0m4.831s	0m4.870s	0m4.871s
		wide	objects vs. DIA sources	0m9.118s	0m9.155s	0m8.587s
			DIA sources vs. objects	0m6.807s	0m6.811s	0m6.234s

For the fine chunking scheme, the match is processed one stripe at a time (for stripes overlapping the FOV) yielding 11 units of work that are currently executed in serial fashion, but could be run in parallel.

FOV	Row size	match-width	match direction	Time - 1st run	Time - 2nd run
High density	Skinny	slim	objects vs. DIA sources	1m20.287s	1m20.58s
			DIA sources vs. objects	0m56.065s	0m56.5s
		wide	objects vs. DIA sources	1m21.169s	1m21.151s
			DIA sources vs. objects	0m57.72s	0m59.873s
	Fat	slim	objects vs. DIA sources	2m9.347s	2m9.122s
			DIA sources vs. objects	1m18.949s	1m18.181s
		wide	objects vs. DIA sources	2m26.441s	2m26.048s
			DIA sources vs. objects	1m37.277s	1m38.656s
Low density	Skinny	slim	objects vs. DIA sources	0m10.592s	0m10.595s
			DIA sources vs. objects	0m8.498s	0m8.44s
		wide	objects vs. DIA sources	0m11.056s	0m11.019s
			DIA sources vs. objects	0m9.024s	0m9.228s
	Fat	slim	objects vs. DIA sources	0m12.064s	0m12.01s
			DIA sources vs. objects	0m9.141s	0m9.14s
		wide	objects vs. DIA sources	0m16.527s	0m16.656s
			DIA sources vs. objects	0m13.04s	0m13.156s
Avg. density	Skinny	slim	objects vs. DIA sources	0m15.764s	0m15.812s
			DIA sources vs. objects	0m13.407s	0m13.378s
		wide	objects vs. DIA sources	0m16.19s	0m16.135s
			DIA sources vs. objects	0m14.016s	0m14.033s
	Fat	slim	objects vs. DIA sources	0m21.451s	0m21.437s
			DIA sources vs. objects	0m16.337s	0m16.392s
		wide	objects vs. DIA sources	0m26.95s	0m26.814s
			DIA sources vs. objects	0m21.757s	0m21.92s

8.4.5.1 Effect of zone height on cross-match performance¶

Varying the zone height doesn’t have a huge effect on performance - smaller zone heights are slightly faster than larger ones, until one reaches 20 arcseconds. At a zone height of 15 arcseconds, performance suddenly drops by a factor of 12 to 16. One possible explanation for this would be that the in-memory zone tables involved in the match pass a size threshold below which table scans are used instead of b-tree range lookups.

8.4.5.1.1 2 arcminute zone height¶

FOV	Row size	match-width	match direction	Time
High density	Skinny	slim	objects vs. DIA sources	0m54s
			DIA sources vs. objects	0m36s
		wide	objects vs. DIA sources	0m55s
			DIA sources vs. objects	0m36s

8.4.5.1.2 45 arcsecond zone height¶

FOV	Row size	match-width	match direction	Time
High density	Skinny	slim	objects vs. DIA sources	0m51.203s
			DIA sources vs. objects	0m34.112s
		wide	objects vs. DIA sources	0m51.615s
			DIA sources vs. objects	0m34.511s

8.4.5.1.3 30 arcsecond zone height¶

FOV	Row size	match-width	match direction	Time
High density	Skinny	slim	objects vs. DIA sources	0m50.535s
			DIA sources vs. objects	0m33.615s
		wide	objects vs. DIA sources	0m50.989s
			DIA sources vs. objects	0m34.261s

8.4.5.1.4 20 arcsecond zone height¶

FOV	Row size	match-width	match direction	Time
High density	Skinny	slim	objects vs. DIA sources	0m54.990s
			DIA sources vs. objects	0m35.113s
		wide	objects vs. DIA sources	0m55.579s
			DIA sources vs. objects	0m35.657s

8.4.5.1.5 15 arcsecond zone height¶

FOV	Row size	match-width	match direction	Time
High density	Skinny	slim	objects vs. DIA sources	11m18.951s
			DIA sources vs. objects	8m25.763s
		wide	objects vs. DIA sources	11m20.142s
			DIA sources vs. objects	8m28.906s

9 Database Performance for Forced Photometry¶

Tested performance of database[♦] using data produced based on SDSS stripe 82 DR8. Database location: lsst10. Test involved running different number of concurrent queries and measuring response time. Each query was selecting sources for a randomly picked bounding box, and returned ~1200-1300 rows. Executed queries looked like:

SELECT scisql_s2CPolyToBin(ra0, decl0, ra1, decl1, ra2, decl2, ra3, decl3) INTO @poly
  FROM Field
  WHERE uniqueId = <uniqueId>;

CALL scisql.scisql_s2CPolyRegion(@poly, 20);

SELECT sourceId, ra, decl, htmId20
  FROM Source AS s
  INNER JOIN scisql.Region AS r ON (s.htmId20 BETWEEN r.htmMin AND r.htmMax)
  WHERE scisql_s2PtInCPoly(ra, decl, @poly) = 1;

First test involved running 10,000 queries total, all on a single node (lsst9) and varying the level of parallelism. During each test all queries were started at exactly the same time, and there was no time delay introduced between them. Results:

queries per thread	threads	slowest elapsed time [sec]	queries/sec
10,000	1	466.2	21.5
5,000	2	247.7	20.2
2,500	4	144.4	17.3
2,000	5	124.2	16.1
1,250	8	96.6	12.9
1,000	10	91.8	10.9
400	25	94.6	4.2
200	50	95.6	1.0
100	100	100.0	1.0

Based on the timer in the test program, the database response time was about the same for each test (~0.2 sec), the limiting factor was threading in the python test program that run all threads.

Second test involved querying the database from two different nodes (lsst9 and lsst5), 100 threads per node. Measured elapsed time was comparable to the single-node test:

queries per thread	threads	slowest elapsed time [sec]	queries/sec
100	50 x 2 machines	109.0	0.9

During this test, average query response time was up (approaching ~0.4 sec) - less time was spent in the python client, but the server response was slower. The limiting factor was server CPU. The server was completely cpu bound - all 24 cpus were 100% busy. show processlist showed 200 simultaneous queries processed by the server.

During all these tests there was almost no disk I/O activity. This is not unexpected, as the entire data set including indexes is < 8 GB in size, much smaller than RAM available on lsst10.

Related code: see git repo: contrib/forcedPhotDbDemo.git

9.1 About Data Ingest¶

Counts: ~200 million sources (~6GB + 2GB index), ~200K fields (~15 MB).

Loading the data took: ~35 min, sorting by htmId another 35 min.

9.2 Conclusions¶

The server server seems to be handling well 100-200 requests coming in simultaneously, and deliver results in well under a second, even under heavy load, provided there are no other activities competing for mysqld server time and memory. Ideally, it’d be useful to try querying from many more clients, but given in practice forced photometry code will crunch on the data for ~10 sec, thus there there will be ~10 sec delay between queries incoming from the same process, the load seen by the server should be much lower than that tested.

10 References¶

[1]

[Document-5356]. Tim Axelrod, Robyn Allsman, Jeff Kantor, Jon Myers, Francesco Pierfederici, Serge Monkewicz, Greg Daues, David Gehrig, Steve Pietrowicz, Raymond Plante, Jacek Becla, Kian-Tat Lim, Andrew Becker, Lynne Jones, Russell Owen, Nicole Silvestri, Robert H. Lupton, Jeff Bartels, Phil Pinto, Jeff Kantor, Michael Strauss, and Don Sweeney. LSST Data Challenge 2. 2008. URL, https://ls.st/Document-5356.

[2]	[LDM-17]. Tim Axelrod and others. Lsst data challenge 3a final report. 2009. URL, https://ls.st/LDM-17.

[3]	[DMTR-12]. Jacek Becla. Qserv 300 node test. 2013. URL, https://ls.st/DMTR-12.

[4]	[Document-11625]. Jacek Becla, K-T Lim, and Daniel Wang. Database Architecture. 2011. URL, https://ls.st/Document-11625.

[5]	(1, 2) [Document-26276]. Jacek Becla, K-T Lim, and Daniel Wang. Scalable partitioning. 2013. URL, https://ls.st/Document-26276.

[6]	[DMTR-21]. Jacek Becla, K-T Lim, and Daniel Wang. Early (pre-2013) Large-Scale Qserv Tests. 2013. URL, https://ls.st/DMTR-21.

[7]	[LDM-141]. Jacek Becla and K.-T. Lim. Data management storage sizing and i/o model. 2013. URL, https://ls.st/LDM-141.

[8]	(1, 2) [Document-11701]. Jacek Becla, Kian-Tat Lim, and Daniel Wang. Evaluation of solid state disks. 2011. URL, https://ls.st/Document-11701.

[9]	[LDM-135]. Jacek Becla, Daniel Wang, Serge Monkewitz, K-T Lim, Douglas Smith, and Bill Chickering. Database design. 2013. URL, https://ls.st/LDM-135.

[10]

Jim Gray. The zones algorithm for finding points-near-a-point or cross-matching spatial datasets. Technical Report MSR-TR-2006-52, Microsoft, April 2006. URL: https://www.microsoft.com/en-us/research/publication/the-zones-algorithm-for-finding-points-near-a-point-or-cross-matching-spatial-datasets/.

[11]	(1, 2, 3) [Document-9044]. Jeff Kantor, Tim Axelrod, Robyn Allsman, Mike Freemon, and K-T Lim. Data Challenge 3b Overview. 2010. URL, https://ls.st/Document-9044.

[12]	[LDM-552]. Fritz Mueller. Qserv software test specification. 2017. URL, https://ls.st/LDM-552.

[13]	[Document-9541]. Ray Plante, Robyn Allsman, Tim Axelrood, Jacek Becla, Cristina Beldica, Greg Daues, Jeff Kantor, Jonathan Myers, Russell Owen, and Steve Pietrowicz. Results from data challenge 1. 2010. URL, https://ls.st/Document-9541.

[14]

L. Nicastro and G. Calderone. Indexing Astronomical Database Tables using HTM and HEALPix. In R. W. Argyle, P. S. Bunclark, and J. R. Lewis, editors, Astronomical Data Analysis Software and Systems XVII, volume 394 of Astronomical Society of the Pacific Conference Series, 487. August 2008. arXiv:0711.4964.

[*]	It is worth noting that in real production we do not anticipate to manage billion+ rows in a single physical table - the Qserv system that we are developing will split every large table into smaller, manageable pieces.

[†]	Some of these optimizations will not be required when we use Qserv, as Qserv will apply them internally.

[‡]	Original location of this report: https://dev.lsstcorp.org/trac/wiki/db/SpatialJoinPerf

[§]	Original location of this 2009 report: https://dev.lsstcorp.org/trac/wiki/db/BuildSubPart

[¶]	Original location of this 2009 report: https://dev.lsstcorp.org/trac/wiki/db/SubPartOverhead

[#]	Original location of this 2009 report: https://dev.lsstcorp.org/trac/wiki/MyIsamCompression

[♠]	Original location of this 2009 report: https://dev.lsstcorp.org/trac/wiki/DbStoringRefCat

[♥]	Original location of this 2007 report: https://dev.lsstcorp.org/trac/wiki/db/DC2/PartitioningTests

[♦]	Original location of this 2012 report: https://dev.lsstcorp.org/trac/wiki/db/tests/ForcedPhoto

DMTN-048: Qserv design prototyping experiments

1 Design Trade-Offs¶

1.1 Standalone Tests¶

1.1.1 Spatial join performance¶

1.1.2 Building sub-partitions¶

1.1.3 Sub-partition overhead¶

1.1.4 Avoiding materializing sub-partitions¶

1.1.5 Billion row table / reference catalog¶

1.1.6 Compression¶

1.1.7 Full table scan performance¶

1.1.8 Low-volume queries¶

1.1.9 Solid state disks¶

1.2 Data Challenge Related Tests¶

1.2.1 DC1: data ingest¶

1.2.2 DC2: source/object association¶

1.2.3 DC3: catalog construction¶

1.2.4 Winter-2013 Data Challenge: querying database for forced photometry¶

1.2.5 Winter-2013 Data Challenge: partitioning 2.6 TB table for Qserv¶

1.2.6 Winter-2013 Data Challenge: multi-billion-row table¶

2 Other Demonstrations¶

2.1 Shared Scans¶

2.2 Fault Tolerance¶

2.2.1 Worker failure¶

2.2.2 Data corruption¶

2.2.3 Future tests¶

2.3 Multiple Qserv Installations on a Single Machine¶

3 Spatial Join Performance¶

3.1 Using index¶

3.2 CPU utilization¶

3.3 Near neighbor without building sub-chunking¶

3.4 Near neighbor with predicates¶

3.5 Numbers for lsst10¶

4 Building sub-partitions¶

4.1 Numbers for DC3b¶

4.1.1 Justification¶

4.2 Testing¶

4.2.1 Test 1: CREATE TABLE SELECT FROM WHERE approach¶

4.2.2 Test 2: Segregating in a client program (ascii)¶

4.2.3 Test 3: Segregating in a client (binary protocol)¶

4.3 Note on concurrency¶

4.4 Timing for DC3b¶

5 Sub-partitioning overhead [¶]¶

5.1 Introduction¶

5.2 Conclusions¶

5.3 Test one: determining cost of full table scan¶

5.4 Test two: determining cost of dealing with many small tables¶

5.5 Test three: using subChunkIds¶

5.6 Test four: using “skinny” subChunkIds¶

5.7 Test five: sub-partitioning in a client program¶

5.8 Test six: stored procedure¶

6 MyISAM Compression Performance¶

6.1 Overview¶

6.2 1 minute summary¶

6.3 Goals¶

6.4 Test Conditions¶

6.4.1 Relevance to real data¶

6.4.2 Hardware configuration¶

6.4.3 Test Queries¶

6.4.4 Numbers¶

6.4.4.1 Sizes:¶

6.4.4.2 Query Performance:¶

6.5 Truncated performance¶

6.6 Results discussion¶

6.6.1 Measurement variability¶

6.7 Take-home messages (“conclusions”)¶

7 Storing Reference Catalog[♠]¶

7.1 Building the Reference Catalogs¶

7.2 Description of the Tests¶

7.3 Quick Summary¶

7.3.1 Data ingest¶

7.3.2 Building key¶

7.3.3 Adding primary key¶

7.3.4 Full table scan¶

7.3.5 Selecting through index¶

7.3.5.1 Small # rows¶

7.3.5.2 Large # rows (full index scan)¶

7.3.6 Speed of join¶

7.3.7 Full USNOB-1 Schema¶

8 DC2 Database Partitioning Tests¶

8.1 Code¶

4.2.1 Test 1: `CREATE TABLE SELECT FROM WHERE` approach¶