欢迎光临
我们一直在努力

什么是opr计活久见——不同用户不同执行计划

这几天被一个生产问题折磨得死去活来,CPU愣是给我干烧了,在上周某天晚上,业务反馈业务响应速度骤降,但是DBA上去执行慢SQL却特别快,经过排查,居然是不同的用户,生成的执行计划不一样!DBA表示惊呆了,虽然我很早之前也分享过一篇类似的案例(11的原生分区) 👉🏻 不同用户的执行计划居然会不一样?,但是当时没有深究,结果好巧不巧,前阵子某位同事踩了一次(13的原生分区),没过几天另一位同事又踩了一次,并且都是分区表。接二连三掉到坑里,看样子躲是躲不过了,看样子只有解决了才能睡个安稳觉了。

现象很明了,但是成因很费解,此处以 test 表作为替代脱敏

图片

可以看到仅仅是切换了一下用户,生成的执行计划居然有差异!根据过往经验(经验不足),不同用户执行计划不一样,可能是用户级配置了某些优化器参数

postgres=# alter user u1 set enable_seqscan to off;
ALTER ROLE
postgres=# select * from pg_user;
usename  | usesysid | usecreatedb | usesuper | userepl | usebypassrls | passwd  | valuntil |     useconfig      
----------+----------+-------------+----------+---------+--------------+----------+----------+----------------------
postgres |       10 | t           | t        | t       | t            | ******** |          |
u1       |    16452 | f           | f        | t       | f            | ******** |          | {enable_seqscan=off}
u2       |    25165 | f           | f        | f       | f            | ******** |          |
(3 rows)

但是排查了一下,并未发现有任何特定的设置。至此常规的思维又卡住了,莫非又是什么离奇的BUG?

根据前面的现象,可以看到超级用户预估的行数和实际执行的行数接近,扫描6月7月均使用的顺序扫描(都是240W),但是业务用户扫描6月分区表却选择了索引扫描,的确若按照优化器预估的行数 (rows=1)来选择, 走索引扫描是最快的,但是真实情况却返回了240W行,因此优化器不得不频繁回表,导致大量的离散IO,那天业务雪崩变慢的原因也就说得通了。

至此,原因明了了,但是原因呢?为什么不同的用户执行计划会不一样?假如不同用户看到的统计信息是不一样的,这样就乱套了,每个用户跑出来的SQL都可能不一样,岂不是还要看脸决定跑的孰快孰慢

遇事不决,只能看代码了。由于涉及到执行计划的生成,又要回顾一下SQL的执行原理了,具体细节这里就不再赘述。友情提示:下方涉及到大量优化器代码,可能会枯燥,没有一定基础的同学看起来会比较费力。

图片

一条SQL进来之后,会经过Parser → Analyzer → Rewriter → Planner → Executor这一系列步骤。若是DDL语句,无需进行优化,到utility模块处理,对于DML则需要按照完整的流程。在 Planner 阶段,会经过逻辑优化与物理优化,最终生成一颗最优的计划树。

查询规划整体流程参照下图 (巨人肩膀:https://www.cnblogs.com/flying-tiger/p/6063709.html)

img

standard_planner 作为优化器的入口,由于流程过于复杂冗长,这里我们只关心其中几个核心函数:

  /* primary planning entry point (may recurse for subqueries) */
root = subquery_planner(glob, parse, NULL,
false, tuple_fraction);

/* Select best Path and turn it into a Plan */
final_rel = fetch_upper_rel(root, UPPERREL_FINAL, NULL);
best_path = get_cheapest_fractional_path(final_rel, tuple_fraction);

top_plan = create_plan(root, best_path);

subquery_planner会做一些预处理,比如提升子链接 (pull_up_sublinks)、提升子查询(pull_up_subqueries)、函数内联 (inline_set_returning_functions) 等等,接着选择一条成本最低的路劲,接着基于此路径构建执行计划 subquery_planner → get_cheapest_fractional_path → create_plan。

其中 subquery_planner 做完预处理的步骤之后,会进入到 grouping_planner 做进一步的优化,参照下图

img

  /*
* Do the main planning. If we have an inherited target relation, that
* needs special processing, else go straight to grouping_planner.
*/
if (parse->resultRelation &&
rt_fetch(parse->resultRelation, parse->rtable)->inh)
inheritance_planner(root);
else
grouping_planner(root, false, tuple_fraction);

然后进入到 query_planner,生成一条成本最低的路径

    /*
* Generate the best unsorted and presorted paths for the scan/join
* portion of this Query, ie the processing represented by the
* FROM/WHERE clauses. (Note there may not be any presorted paths.)
* We also generate (in standard_qp_callback) pathkey representations
* of the query's sort clause, distinct clause, etc.
*/
current_rel = query_planner(root, tlist,
standard_qp_callback, &qp_extra);
/*
* query_planner
* Generate a path (that is, a simplified plan) for a basic query,
* which may involve joins but not any fancier features.
*
* Since query_planner does not handle the toplevel processing (grouping,
* sorting, etc) it cannot select the best path by itself. Instead, it
* returns the RelOptInfo for the top level of joining, and the caller
* (grouping_planner) can choose among the surviving paths for the rel.
*
* root describes the query to plan
* tlist is the target list the query should produce
* (this is NOT necessarily root->parse->targetList!)
* qp_callback is a function to compute query_pathkeys once it's safe to do so
* qp_extra is optional extra data to pass to qp_callback
*
* Note: the PlannerInfo node also includes a query_pathkeys field, which
* tells query_planner the sort order that is desired in the final output
* plan. This value is *not* available at call time, but is computed by
* qp_callback once we have completed merging the query's equivalence classes.
* (We cannot construct canonical pathkeys until that's done.)
*/
RelOptInfo *
query_planner(PlannerInfo *root, List *tlist,
 query_pathkeys_callback qp_callback, void *qp_extra)
/*
* build_index_paths
* Given an index and a set of index clauses for it, construct zero
* or more IndexPaths. It also constructs zero or more partial IndexPaths.
给定一个索引和一组索引子句,构造零个或多个IndexPath。它还能构造零个或多个部分IndexPath。
*
* We return a list of paths because (1) this routine checks some cases
* that should cause us to not generate any IndexPath, and (2) in some
* cases we want to consider both a forward and a backward scan, so as
* to obtain both sort orders. Note that the paths are just returned
* to the caller and not immediately fed to add_path().

最终进入到 cost_index 中,进入到索引的代码估算。

/*
* cost_index
* Determines and returns the cost of scanning a relation using an index.
*
* 'path' describes the indexscan under consideration, and is complete
* except for the fields to be set by this routine
* 'loop_count' is the number of repetitions of the indexscan to factor into
* estimates of caching behavior
...
...
/*
* In the perfectly correlated case, the number of pages touched by
* each scan is selectivity * table_size, and we can use the Mackert
* and Lohman formula at the page level to estimate how much work is
* saved by caching across scans. We still assume all the fetches are
* random, though, which is an overestimate that's hard to correct for
* without double-counting the cache effects. (But in most cases
* where such a plan is actually interesting, only one page would get
* fetched per scan anyway, so it shouldn't matter much.)
*/
pages_fetched = ceil(indexSelectivity * (double) baserel->pages);

pages_fetched = index_pages_fetched(pages_fetched * loop_count,
baserel->pages,
(double) index->pages,
root);

索引预估返回的页数可以看到由 indexSelectivity(索引选择率) 乘上页数,每一类索引都有自己的花费基本值估算函数,比如最常见的 Btree 估算函数是 btcostestimate 函数

	/*
* If index is unique and we found an '=' clause for each column, we can
* just assume numIndexTuples = 1 and skip the expensive
* clauselist_selectivity calculations. However, a ScalarArrayOp or
* NullTest invalidates that theory, even though it sets eqQualHere.
*/
if (index->unique &&
indexcol == index->nkeycolumns - 1 &&
eqQualHere &&
!found_saop &&
!found_is_null_op)
numIndexTuples = 1.0;
else

else
{
/* Estimate selectivity for a restriction clause. */
s1 = restriction_selectivity(root, opno,
opclause->args,
opclause->inputcollid,
varRelid);
}

超级用户(正常),s1是选择率

图片

业务用户

图片

可以看到超级用户计算出来的选择率是1(<=now() 的选择率),而业务用户计算出来的选择率是0.5,回顾一下我在执行计划篇章写过的内容,对于多列选择率,默认情况下,PostgreSQL会将各列的选择率相乘,但是优化器并没有这么stupid,他也有自己的一系列算法。

举个栗子,下面这几条SQL如何估算返回的行数?

  1. where a = xx or b =xx

  2. where a > xx or a < xx 同一个变量的范围查询

  3. where a not in (xx)

  4. where a is not null

  5. where a > xxx and b < xx

对于第二种情况,同一个变量的范围查询,即 where a > xxx and a < xxx,他的选择率是 rqlist->hibound + rqlist->lobound – 1.0,按照我们刚刚调试的结果,第一个谓词 (date_start_time >=’2023-06-16 22:49:46′ )的选择率是 0.49724336722683249,第二个谓词 (date_start_time <= now() )的选择率是 0.5,那么 0.5 + 0.49724336722683249 – 1 = – 0.00275663277,大于 -0.01

// 我们还识别"范围查询",例如"x > 34 AND x < 42"。如果子句是其运算符具有 scalarltsel 
// 或相关函数作为其约束选择性估计器的约束 opclause,则子句被视为可能的范围查询成分。我们
// 将引用相同变量的这种形式的子句配对。这种不可配对的子句以正常方式简单地乘以选择性乘积。但
// 是当我们找到一对时,我们知道选择性代表了列范围内下限和上限的相对位置,因此我们可以将其计
// 算为 hisel + osel - 1,而不是将其计算为 hisel * lostl。(为了可视化这一点,假设
// hisel 是范围低于上限的比率,而 lossl 是高于下限的比率;因此 hisel 可以直接解释为0..1
// 值,但我们需要将 lossl 转换为 1- lossl 在将其解释为值之前。那么可用范围是 1-losel +
// hisel。但是,这个计算双重排除了空值,所以我们真的需要 hisel + lostl + null_frac- 1

* We also recognize "range queries", such as "x > 34 AND x < 42". Clauses
* are recognized as possible range query components if they are restriction
* opclauses whose operators have scalarltsel or a related function as their
* restriction selectivity estimator. We pair up clauses of this form that
* refer to the same variable. An unpairable clause of this kind is simply
* multiplied into the selectivity product in the normal way. But when we
* find a pair, we know that the selectivities represent the relative
* positions of the low and high bounds within the column's range, so instead
* of figuring the selectivity as hisel * losel, we can figure it as hisel +
* losel - 1. (To visualize this, see that hisel is the fraction of the range
* below the high bound, while losel is the fraction above the low bound; so
* hisel can be interpreted directly as a 0..1 value but we need to convert
* losel to 1-losel before interpreting it as a value. Then the available
* range is 1-losel to hisel. However, this calculation double-excludes
* nulls, so really we need hisel + losel + null_frac - 1.)

else

else
{
/*
* It's just roundoff error; use a small positive
* value
*/
s2 = 1.0e-10;
}
}
}

所以他会走到else的流程中,那么选择率就会被计算成 s2 = 1.0e-10,也就是 0.000000001,那么乘以元组数,所以预估出来的 rows = 1 就是这么计算出来的。

至此,优化器预估的 1 行的原因找到了,但是另外一个问题抛出来了——为什么超级用户计算出来的选择率是1,而业务用户是0.5?看样子需要深入分析 restriction_selectivity 这个函数做了什么了。

在此之前,先让我们了解一下什么是 restriction,顾名思义——限制,

/*
* Restriction clause info.
*
* We create one of these for each AND sub-clause of a restriction condition
* (WHERE or JOIN/ON clause). Since the restriction clauses are logically
* ANDed, we can use any one of them or any subset of them to filter out
* tuples, without having to evaluate the rest. The RestrictInfo node itself
* stores data used by the optimizer while choosing the best query plan.
*
* If a restriction clause references a single base relation, it will appear
* in the baserestrictinfo list of the RelOptInfo for that base rel.
...
*/

typedef struct RestrictInfo
RestrictInfo;

我们可以简单理解成——对于查询中每个表,Postgres 都会生成两个”限制”子句列表:

  • 基本限制子句: WHERE 子句的一部分,并且是仅涉及表本身的表达式

  • Join 子句:通常是 JOIN 子句的一部分,以及涉及多个表的表达式,比如 on test.id = test2.id

之所以称这些”限制”子句是因为它们会”限制”(或过滤)从表返回的数据量。

/*
* restriction_selectivity
*
* Returns the selectivity of a specified restriction operator clause.
* This code executes registered procedures stored in the
* operator relation, by calling the function manager.
*
* See clause_selectivity() for the meaning of the additional parameters.
*/
Selectivity
restriction_selectivity(PlannerInfo *root,
Oid operatorid,
List *args,
Oid inputcollid,
int varRelid)

根据注释,我们可以了解到 restriction_selectivity 返回某个操作符字符的选择率,系统表 pg_operator 中记录了每个操作符对应的函数

  • oprrest regproc (references pg_proc.oid):Restriction selectivity estimation function for this operator (zero if none)

  • oprjoin regproc (references pg_proc.oid):Join selectivity estimation function for this operator (zero if none)

经过查询,用到的函数是scalarlesel,其实我在执行计划篇章里面也有写过

#define F_SCALARLTSEL 103
#define F_SCALARLESEL 336
#define F_SCALARGTSEL 104
#define F_SCALARGESEL 337

postgres=# select oid,proname from pg_proc where oid in ('103','104','336','337');
oid | proname
-----+-------------
103 | scalarltsel ---less than,小于
104 | scalargtsel ---greater than,大于
336 | scalarlesel ---less equal,小于等于
337 | scalargesel ---greate equal,大于等于
(4 rows)

查询引擎在查询语法树的WHERE子句中识别出比较条件,再到pg_operator中根据操作符和数据类型找到oprrest为scalarlesel,这是通用的标量数据类型的小于等于操作符的代价估算函数,然后在ineq_histogram_selectivity进行直方图的估算和高频值的估算mcv_selectivity

/*
* scalarineqsel - Selectivity of "<", "<=", ">", ">=" for scalars.
*
* This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel.
* The isgt and iseq flags distinguish which of the four cases apply.
*
* The caller has commuted the clause, if necessary, so that we can treat
* the variable as being on the left. The caller must also make sure that
* the other side of the clause is a non-null Const, and dissect that into
* a value and datatype. (This definition simplifies some callers that
* want to estimate against a computed value instead of a Const node.)
*
* This routine works for any datatype (or pair of datatypes) known to
* convert_to_scalar(). If it is applied to some other datatype,
* it will return an approximate estimate based on assuming that the constant
* value falls in the middle of the bin identified by binary search.
*/
static double
scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
VariableStatData *vardata, Datum constval, Oid consttype)

stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple);

fmgr_info(get_opcode(operator), &opproc);

/*
* If we have most-common-values info, add up the fractions of the MCV
* entries that satisfy MCV OP CONST. These fractions contribute directly
* to the result selectivity. Also add up the total fraction represented
* by MCV entries.
*/
mcv_selec = mcv_selectivity(vardata, &opproc, constval, true,
&sumcommon);

/*
* If there is a histogram, determine which bin the constant falls in, and
* compute the resulting contribution to selectivity.
*/
hist_selec = ineq_histogram_selectivity(root, vardata,
&opproc, isgt, iseq,
constval, consttype);
...

而这两个函数里面都有一个关键的地方

/*
* ineq_histogram_selectivity - Examine the histogram for scalarineqsel
*
* Determine the fraction of the variable's histogram population that
* satisfies the inequality condition, ie, VAR < (or <=, >, >=) CONST.
* The isgt and iseq flags distinguish which of the four cases apply.
*
* Returns -1 if there is no histogram (valid results will always be >= 0).
*
* Note that the result disregards both the most-common-values (if any) and
* null entries. The caller is expected to combine this result with
* statistics for those portions of the column population.
*/
static double
ineq_histogram_selectivity(PlannerInfo *root,
VariableStatData *vardata,
FmgrInfo *opproc, bool isgt, bool iseq,
Datum constval, Oid consttype)

这个oid经过打印,发现是2521,即 timestamp_le_timestamptz,根据名字来判断的话,判断 timestamp 类型是否小于等于 timestamp with time zone 的比较函数,确实查看表结构,date_start_time是timestamp without time zone,而我传递的now是timestamp。

postgres=# select * from pg_proc where oid = 2521;
-[ RECORD 1 ]---+-------------------------
oid | 2521
proname | timestamp_le_timestamptz
pronamespace | 11
proowner | 10
prolang | 12
procost | 1
prorows | 0
provariadic | 0
prosupport | -
prokind | f
prosecdef | f
proleakproof | f
proisstrict | t
proretset | f
provolatile | s
proparallel | s
pronargs | 2
pronargdefaults | 0
prorettype | 16
proargtypes | 1114 1184
proallargtypes |
proargmodes |
proargnames |
proargdefaults |
protrftypes |
prosrc | timestamp_le_timestamptz
probin |
prosqlbody |
proconfig |
proacl |

难道这个用户没有权限?打印一下

图片

果然!普通用户返回的是 false!而超级用户是 true!

图片

vardata是个VariableStatData类型的结构体,其中包括RelOptInfo,表的基本信息

/* Return data from examine_variable and friends */
typedef struct VariableStatData
VariableStatData;

让我们打印出来确认一下是什么表没有权限

图片

可以很清晰的看到,pages = 335017,tuples = 4912245,没错,正是6月子分区这个表!

postgres=# select reltuples,relpages from pg_class where relname = 'xxx';
-[ RECORD 1 ]----
reltuples | 4.91224e+06
relpages | 335017

于是我手动使用set将 vardata ->acl_ok设为true之后,再去打印执行计划,这次执行计划就变成正确的了。

图片

为了清晰可见,我特意将对比图打印了出来:第一个执行计划是我手动调试生成的,第二个是默认情况生成的。可以看到,当有了表的查询权限之后,执行计划就正确了!

那让我们手动授予权限试一下,grant select on 子表名 to 业务用户。

图片

可以很清晰的看到,执行计划终于正确了。预估行数和超级用户一样。另外细心的同学可能发现了,7月的执行计划预估行数和超级用户的也不一样,其实原理也是一样的,授予了select权限即可。

至此这个案例的成因已经明了,分区表在预估行数的时候,代码逻辑可能会去判断子表的查询权限,如果没有,那么可能就会生成一个莫名其妙的选择率出来,正常来说,PostgreSQL会在Analyzer阶段去检查表的权限,业务用户本身也不需要子表的查询权限,查询父表即可,但是实际情况是:PostgreSQL需要检测业务用户对于子表的查询权限(pg_statistic对应数据,参照statistic_proc_security_check函数)。

当然这个案例十分罕见,我玩了这么久也仅仅遇到三次,在学徒①群里也有小伙伴反馈遇到过这个问题

图片

所以为了稳妥起见,建议给所有的子表都授予查询权限,至少可以预防这种百年一见的奇怪案例。这块代码在15中是一样的逻辑,感兴趣的老铁可以去给官方提issue了,说不定致谢名单里就有你了😎

https://www.postgresql.org/docs/14/index-cost-estimation.html

https://www.cnblogs.com/flying-tiger/p/6087184.html

https://blog.csdn.net/cuichao1900/article/details/100394716

赞(0)
未经允许不得转载:上海聚慕医疗器械有限公司 » 什么是opr计活久见——不同用户不同执行计划

登录

找回密码

注册