什么是opr计活久见——不同用户不同执行计划-上海聚慕医疗器械有限公司

这几天被一个生产问题折磨得死去活来，CPU愣是给我干烧了，在上周某天晚上，业务反馈业务响应速度骤降，但是DBA上去执行慢SQL却特别快，经过排查，居然是不同的用户，生成的执行计划不一样！DBA表示惊呆了，虽然我很早之前也分享过一篇类似的案例（11的原生分区） 👉🏻 不同用户的执行计划居然会不一样？，但是当时没有深究，结果好巧不巧，前阵子某位同事踩了一次（13的原生分区），没过几天另一位同事又踩了一次，并且都是分区表。接二连三掉到坑里，看样子躲是躲不过了，看样子只有解决了才能睡个安稳觉了。

现象很明了，但是成因很费解，此处以 test 表作为替代脱敏

可以看到仅仅是切换了一下用户，生成的执行计划居然有差异！根据过往经验（经验不足），不同用户执行计划不一样，可能是用户级配置了某些优化器参数

postgres=# alter user u1 set enable_seqscan to off;
ALTER ROLE
postgres=# select * from pg_user;
 usename  | usesysid | usecreatedb | usesuper | userepl | usebypassrls |  passwd  | valuntil |      useconfig       
----------+----------+-------------+----------+---------+--------------+----------+----------+----------------------
 postgres |       10 | t           | t        | t       | t            | ******** |          | 
 u1       |    16452 | f           | f        | t       | f            | ******** |          | {enable_seqscan=off}
 u2       |    25165 | f           | f        | f       | f            | ******** |          | 
(3 rows)

但是排查了一下，并未发现有任何特定的设置。至此常规的思维又卡住了，莫非又是什么离奇的BUG？

根据前面的现象，可以看到超级用户预估的行数和实际执行的行数接近，扫描6月7月均使用的顺序扫描（都是240W），但是业务用户扫描6月分区表却选择了索引扫描，的确若按照优化器预估的行数 (rows=1)来选择，走索引扫描是最快的，但是真实情况却返回了240W行，因此优化器不得不频繁回表，导致大量的离散IO，那天业务雪崩变慢的原因也就说得通了。

至此，原因明了了，但是原因呢？为什么不同的用户执行计划会不一样？假如不同用户看到的统计信息是不一样的，这样就乱套了，每个用户跑出来的SQL都可能不一样，岂不是还要看脸决定跑的孰快孰慢。

遇事不决，只能看代码了。由于涉及到执行计划的生成，又要回顾一下SQL的执行原理了，具体细节这里就不再赘述。友情提示：下方涉及到大量优化器代码，可能会枯燥，没有一定基础的同学看起来会比较费力。

一条SQL进来之后，会经过Parser → Analyzer → Rewriter → Planner → Executor这一系列步骤。若是DDL语句，无需进行优化，到utility模块处理，对于DML则需要按照完整的流程。在 Planner 阶段，会经过逻辑优化与物理优化，最终生成一颗最优的计划树。

查询规划整体流程参照下图 (巨人肩膀：https://www.cnblogs.com/flying-tiger/p/6063709.html)

standard_planner 作为优化器的入口，由于流程过于复杂冗长，这里我们只关心其中几个核心函数：

  /* primary planning entry point (may recurse for subqueries) */
  root = subquery_planner(glob, parse, NULL,
              false, tuple_fraction);

  /* Select best Path and turn it into a Plan */
  final_rel = fetch_upper_rel(root, UPPERREL_FINAL, NULL);
  best_path = get_cheapest_fractional_path(final_rel, tuple_fraction);

  top_plan = create_plan(root, best_path);

subquery_planner会做一些预处理，比如提升子链接 (pull_up_sublinks)、提升子查询(pull_up_subqueries)、函数内联 (inline_set_returning_functions) 等等，接着选择一条成本最低的路劲，接着基于此路径构建执行计划 subquery_planner → get_cheapest_fractional_path → create_plan。

其中 subquery_planner 做完预处理的步骤之后，会进入到 grouping_planner 做进一步的优化，参照下图

  /*
   * Do the main planning.  If we have an inherited target relation, that
   * needs special processing, else go straight to grouping_planner.
   */
  if (parse->resultRelation &&
    rt_fetch(parse->resultRelation, parse->rtable)->inh)
    inheritance_planner(root);
  else
    grouping_planner(root, false, tuple_fraction);

然后进入到 query_planner，生成一条成本最低的路径

    /*
     * Generate the best unsorted and presorted paths for the scan/join
     * portion of this Query, ie the processing represented by the
     * FROM/WHERE clauses.  (Note there may not be any presorted paths.)
     * We also generate (in standard_qp_callback) pathkey representations
     * of the query's sort clause, distinct clause, etc.
     */
    current_rel = query_planner(root, tlist,
                  standard_qp_callback, &qp_extra);

/*
 * query_planner
 *    Generate a path (that is, a simplified plan) for a basic query,
 *    which may involve joins but not any fancier features.
 *
 * Since query_planner does not handle the toplevel processing (grouping,
 * sorting, etc) it cannot select the best path by itself.  Instead, it
 * returns the RelOptInfo for the top level of joining, and the caller
 * (grouping_planner) can choose among the surviving paths for the rel.
 *
 * root describes the query to plan
 * tlist is the target list the query should produce
 *    (this is NOT necessarily root->parse->targetList!)
 * qp_callback is a function to compute query_pathkeys once it's safe to do so
 * qp_extra is optional extra data to pass to qp_callback
 *
 * Note: the PlannerInfo node also includes a query_pathkeys field, which
 * tells query_planner the sort order that is desired in the final output
 * plan.  This value is *not* available at call time, but is computed by
 * qp_callback once we have completed merging the query's equivalence classes.
 * (We cannot construct canonical pathkeys until that's done.)
 */
RelOptInfo *
query_planner(PlannerInfo *root, List *tlist,
        query_pathkeys_callback qp_callback, void *qp_extra)

/*
 * build_index_paths
 *    Given an index and a set of index clauses for it, construct zero
 *    or more IndexPaths. It also constructs zero or more partial IndexPaths.
 给定一个索引和一组索引子句，构造零个或多个IndexPath。它还能构造零个或多个部分IndexPath。
 *
 * We return a list of paths because (1) this routine checks some cases
 * that should cause us to not generate any IndexPath, and (2) in some
 * cases we want to consider both a forward and a backward scan, so as
 * to obtain both sort orders.  Note that the paths are just returned
 * to the caller and not immediately fed to add_path().

最终进入到 cost_index 中，进入到索引的代码估算。

/*
 * cost_index
 *	  Determines and returns the cost of scanning a relation using an index.
 *
 * 'path' describes the indexscan under consideration, and is complete
 *		except for the fields to be set by this routine
 * 'loop_count' is the number of repetitions of the indexscan to factor into
 *		estimates of caching behavior
 ...
 ...
 		/*
		 * In the perfectly correlated case, the number of pages touched by
		 * each scan is selectivity * table_size, and we can use the Mackert
		 * and Lohman formula at the page level to estimate how much work is
		 * saved by caching across scans.  We still assume all the fetches are
		 * random, though, which is an overestimate that's hard to correct for
		 * without double-counting the cache effects.  (But in most cases
		 * where such a plan is actually interesting, only one page would get
		 * fetched per scan anyway, so it shouldn't matter much.)
		 */
		pages_fetched = ceil(indexSelectivity * (double) baserel->pages);

		pages_fetched = index_pages_fetched(pages_fetched * loop_count,
											baserel->pages,
											(double) index->pages,
											root);

索引预估返回的页数可以看到由 indexSelectivity（索引选择率）乘上页数，每一类索引都有自己的花费基本值估算函数，比如最常见的 Btree 估算函数是 btcostestimate 函数

	/*
	 * If index is unique and we found an '=' clause for each column, we can
	 * just assume numIndexTuples = 1 and skip the expensive
	 * clauselist_selectivity calculations.  However, a ScalarArrayOp or
	 * NullTest invalidates that theory, even though it sets eqQualHere.
	 */
	if (index->unique &&
		indexcol == index->nkeycolumns - 1 &&
		eqQualHere &&
		!found_saop &&
		!found_is_null_op)
		numIndexTuples = 1.0;
	else
	
		else
		{
			/* Estimate selectivity for a restriction clause. */
			s1 = restriction_selectivity(root, opno,
										 opclause->args,
										 opclause->inputcollid,
										 varRelid);
		}

超级用户（正常），s1是选择率

业务用户

可以看到超级用户计算出来的选择率是1（<=now() 的选择率），而业务用户计算出来的选择率是0.5，回顾一下我在执行计划篇章写过的内容，对于多列选择率，默认情况下，PostgreSQL会将各列的选择率相乘，但是优化器并没有这么stupid，他也有自己的一系列算法。

举个栗子，下面这几条SQL如何估算返回的行数？

where a = xx or b =xx
where a > xx or a < xx 同一个变量的范围查询
where a not in (xx)
where a is not null
where a > xxx and b < xx

对于第二种情况，同一个变量的范围查询，即 where a > xxx and a < xxx，他的选择率是 rqlist->hibound + rqlist->lobound – 1.0，按照我们刚刚调试的结果，第一个谓词 (date_start_time >=’2023-06-16 22:49:46′ )的选择率是 0.49724336722683249，第二个谓词 (date_start_time <= now() )的选择率是 0.5，那么 0.5 + 0.49724336722683249 – 1 = – 0.00275663277，大于 -0.01

// 我们还识别"范围查询"，例如"x > 34 AND x < 42"。如果子句是其运算符具有 scalarltsel 
 // 或相关函数作为其约束选择性估计器的约束 opclause，则子句被视为可能的范围查询成分。我们
 // 将引用相同变量的这种形式的子句配对。这种不可配对的子句以正常方式简单地乘以选择性乘积。但
 // 是当我们找到一对时，我们知道选择性代表了列范围内下限和上限的相对位置，因此我们可以将其计
 // 算为 hisel + osel - 1，而不是将其计算为 hisel * lostl。（为了可视化这一点，假设
 // hisel 是范围低于上限的比率，而 lossl 是高于下限的比率；因此 hisel 可以直接解释为0..1
 // 值，但我们需要将 lossl 转换为 1- lossl 在将其解释为值之前。那么可用范围是 1-losel +
 // hisel。但是，这个计算双重排除了空值，所以我们真的需要 hisel + lostl + null_frac- 1
 
 * We also recognize "range queries", such as "x > 34 AND x < 42".  Clauses
 * are recognized as possible range query components if they are restriction
 * opclauses whose operators have scalarltsel or a related function as their
 * restriction selectivity estimator.  We pair up clauses of this form that
 * refer to the same variable.  An unpairable clause of this kind is simply
 * multiplied into the selectivity product in the normal way.  But when we
 * find a pair, we know that the selectivities represent the relative
 * positions of the low and high bounds within the column's range, so instead
 * of figuring the selectivity as hisel * losel, we can figure it as hisel +
 * losel - 1.  (To visualize this, see that hisel is the fraction of the range
 * below the high bound, while losel is the fraction above the low bound; so
 * hisel can be interpreted directly as a 0..1 value but we need to convert
 * losel to 1-losel before interpreting it as a value.  Then the available
 * range is 1-losel to hisel.  However, this calculation double-excludes
 * nulls, so really we need hisel + losel + null_frac - 1.)
   
   	  else
			
					else
					{
						/*
						 * It's just roundoff error; use a small positive
						 * value
						 */
						s2 = 1.0e-10;
					}
				}
			}

所以他会走到else的流程中，那么选择率就会被计算成 s2 = 1.0e-10，也就是 0.000000001，那么乘以元组数，所以预估出来的 rows = 1 就是这么计算出来的。

至此，优化器预估的 1 行的原因找到了，但是另外一个问题抛出来了——为什么超级用户计算出来的选择率是1，而业务用户是0.5？看样子需要深入分析 restriction_selectivity 这个函数做了什么了。

在此之前，先让我们了解一下什么是 restriction，顾名思义——限制，

/*
 * Restriction clause info.
 *
 * We create one of these for each AND sub-clause of a restriction condition
 * (WHERE or JOIN/ON clause).  Since the restriction clauses are logically
 * ANDed, we can use any one of them or any subset of them to filter out
 * tuples, without having to evaluate the rest.  The RestrictInfo node itself
 * stores data used by the optimizer while choosing the best query plan.
 *
 * If a restriction clause references a single base relation, it will appear
 * in the baserestrictinfo list of the RelOptInfo for that base rel.
 ...
*/

typedef struct RestrictInfo
 RestrictInfo;

我们可以简单理解成——对于查询中每个表，Postgres 都会生成两个”限制”子句列表：

基本限制子句： WHERE 子句的一部分，并且是仅涉及表本身的表达式
Join 子句：通常是 JOIN 子句的一部分，以及涉及多个表的表达式，比如 on test.id = test2.id

之所以称这些”限制”子句是因为它们会”限制”（或过滤）从表返回的数据量。

/*
 * restriction_selectivity
 *
 * Returns the selectivity of a specified restriction operator clause.
 * This code executes registered procedures stored in the
 * operator relation, by calling the function manager.
 *
 * See clause_selectivity() for the meaning of the additional parameters.
 */
Selectivity
restriction_selectivity(PlannerInfo *root,
						Oid operatorid,
						List *args,
						Oid inputcollid,
						int varRelid)

根据注释，我们可以了解到 restriction_selectivity 返回某个操作符字符的选择率，系统表 pg_operator 中记录了每个操作符对应的函数

oprrest regproc (references pg_proc.oid)：Restriction selectivity estimation function for this operator (zero if none)
oprjoin regproc (references pg_proc.oid)：Join selectivity estimation function for this operator (zero if none)

经过查询，用到的函数是scalarlesel，其实我在执行计划篇章里面也有写过

#define F_SCALARLTSEL 103
#define F_SCALARLESEL 336
#define F_SCALARGTSEL 104
#define F_SCALARGESEL 337

postgres=# select oid,proname from pg_proc where oid in ('103','104','336','337');
 oid |   proname   
-----+-------------
 103 | scalarltsel			---less than，小于
 104 | scalargtsel			---greater than，大于
 336 | scalarlesel			---less equal，小于等于
 337 | scalargesel			---greate equal，大于等于
(4 rows)

查询引擎在查询语法树的WHERE子句中识别出比较条件，再到pg_operator中根据操作符和数据类型找到oprrest为scalarlesel，这是通用的标量数据类型的小于等于操作符的代价估算函数，然后在ineq_histogram_selectivity进行直方图的估算和高频值的估算mcv_selectivity

/*
 *	scalarineqsel		- Selectivity of "<", "<=", ">", ">=" for scalars.
 *
 * This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel.
 * The isgt and iseq flags distinguish which of the four cases apply.
 *
 * The caller has commuted the clause, if necessary, so that we can treat
 * the variable as being on the left.  The caller must also make sure that
 * the other side of the clause is a non-null Const, and dissect that into
 * a value and datatype.  (This definition simplifies some callers that
 * want to estimate against a computed value instead of a Const node.)
 *
 * This routine works for any datatype (or pair of datatypes) known to
 * convert_to_scalar().  If it is applied to some other datatype,
 * it will return an approximate estimate based on assuming that the constant
 * value falls in the middle of the bin identified by binary search.
 */
static double
scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
			  VariableStatData *vardata, Datum constval, Oid consttype)

	stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple);

	fmgr_info(get_opcode(operator), &opproc);

	/*
	 * If we have most-common-values info, add up the fractions of the MCV
	 * entries that satisfy MCV OP CONST.  These fractions contribute directly
	 * to the result selectivity.  Also add up the total fraction represented
	 * by MCV entries.
	 */
	mcv_selec = mcv_selectivity(vardata, &opproc, constval, true,
								&sumcommon);

	/*
	 * If there is a histogram, determine which bin the constant falls in, and
	 * compute the resulting contribution to selectivity.
	 */
	hist_selec = ineq_histogram_selectivity(root, vardata,
											&opproc, isgt, iseq,
											constval, consttype);
  ...

而这两个函数里面都有一个关键的地方

/*
 *	ineq_histogram_selectivity	- Examine the histogram for scalarineqsel
 *
 * Determine the fraction of the variable's histogram population that
 * satisfies the inequality condition, ie, VAR < (or <=, >, >=) CONST.
 * The isgt and iseq flags distinguish which of the four cases apply.
 *
 * Returns -1 if there is no histogram (valid results will always be >= 0).
 *
 * Note that the result disregards both the most-common-values (if any) and
 * null entries.  The caller is expected to combine this result with
 * statistics for those portions of the column population.
 */
static double
ineq_histogram_selectivity(PlannerInfo *root,
						   VariableStatData *vardata,
						   FmgrInfo *opproc, bool isgt, bool iseq,
						   Datum constval, Oid consttype)

这个oid经过打印，发现是2521，即 timestamp_le_timestamptz，根据名字来判断的话，判断 timestamp 类型是否小于等于 timestamp with time zone 的比较函数，确实查看表结构，date_start_time是timestamp without time zone，而我传递的now是timestamp。

postgres=# select * from pg_proc where oid = 2521;
-[ RECORD 1 ]---+-------------------------
oid             | 2521
proname         | timestamp_le_timestamptz
pronamespace    | 11
proowner        | 10
prolang         | 12
procost         | 1
prorows         | 0
provariadic     | 0
prosupport      | -
prokind         | f
prosecdef       | f
proleakproof    | f
proisstrict     | t
proretset       | f
provolatile     | s
proparallel     | s
pronargs        | 2
pronargdefaults | 0
prorettype      | 16
proargtypes     | 1114 1184
proallargtypes  | 
proargmodes     | 
proargnames     | 
proargdefaults  | 
protrftypes     | 
prosrc          | timestamp_le_timestamptz
probin          | 
prosqlbody      | 
proconfig       | 
proacl          |

难道这个用户没有权限？打印一下

果然！普通用户返回的是 false！而超级用户是 true！

vardata是个VariableStatData类型的结构体，其中包括RelOptInfo，表的基本信息

/* Return data from examine_variable and friends */
typedef struct VariableStatData
 VariableStatData;

让我们打印出来确认一下是什么表没有权限

可以很清晰的看到，pages = 335017，tuples = 4912245，没错，正是6月子分区这个表！

postgres=# select reltuples,relpages from pg_class where relname = 'xxx';
-[ RECORD 1 ]----
reltuples | 4.91224e+06
relpages  | 335017

于是我手动使用set将 vardata ->acl_ok设为true之后，再去打印执行计划，这次执行计划就变成正确的了。

为了清晰可见，我特意将对比图打印了出来：第一个执行计划是我手动调试生成的，第二个是默认情况生成的。可以看到，当有了表的查询权限之后，执行计划就正确了！

那让我们手动授予权限试一下，grant select on 子表名 to 业务用户。

可以很清晰的看到，执行计划终于正确了。预估行数和超级用户一样。另外细心的同学可能发现了，7月的执行计划预估行数和超级用户的也不一样，其实原理也是一样的，授予了select权限即可。

至此这个案例的成因已经明了，分区表在预估行数的时候，代码逻辑可能会去判断子表的查询权限，如果没有，那么可能就会生成一个莫名其妙的选择率出来，正常来说，PostgreSQL会在Analyzer阶段去检查表的权限，业务用户本身也不需要子表的查询权限，查询父表即可，但是实际情况是：PostgreSQL需要检测业务用户对于子表的查询权限（pg_statistic对应数据，参照statistic_proc_security_check函数）。

当然这个案例十分罕见，我玩了这么久也仅仅遇到三次，在学徒①群里也有小伙伴反馈遇到过这个问题

所以为了稳妥起见，建议给所有的子表都授予查询权限，至少可以预防这种百年一见的奇怪案例。这块代码在15中是一样的逻辑，感兴趣的老铁可以去给官方提issue了，说不定致谢名单里就有你了😎

https://www.postgresql.org/docs/14/index-cost-estimation.html

https://www.cnblogs.com/flying-tiger/p/6087184.html

https://blog.csdn.net/cuichao1900/article/details/100394716

什么是opr计活久见——不同用户不同执行计划

相关推荐

作者介绍

聚慕医疗

热门文章

切换注册登录

切换登录注册