01 — Bai & Ng (2002)

Determining the Number of Factors in Approximate Factor Models

Econometrica 70(1), 191–221 · Kei Matsumae · 2026-05-15

What this paper does

The first consistent, easy-to-compute estimator of $K$ (number of latent factors) in an approximate factor model when both $N$ and $T$ are large.
Fit PCA with $k$ candidate factors → residual variance $V(k)$ → penalise model complexity with a penalty that vanishes slowly enough to identify the true $K$.
Six concrete criteria: $\text{PC}_{p1}, \text{PC}_{p2}, \text{PC}_{p3}$ and $\text{IC}_{p1}, \text{IC}_{p2}, \text{IC}_{p3}$.
Foundation for Bai (2003) inferential theory, Onatski (2010) eigenvalue-ratio, CRW (2023) regressed-PCA factor count.
For AOF: the default $K$-selector for the US sample (large $N$, large $T$). Switch to Onatski/CRW for rolling-window or JP small-$T$ estimation.

1. Why this paper exists

In a CAPM or Fama-French world, "how many factors?" is decided by theory. In an approximate factor model with latent factors (Chamberlain & Rothschild 1983) the analyst chooses $K$ from data. Pre-2002 this was either ad-hoc (scree plot, eigenvalue threshold) or formally inconsistent (standard AIC/BIC don't work when both panel dimensions grow).

Bai & Ng's contribution is a family of penalised criteria consistently selecting the true $K$ as $N, T \to \infty$ jointly.

1.5. Background — what's an "approximate factor model"?

The paper assumes you know this term. Worth pinning down. Asset-pricing models come in three flavours of increasing realism:

(1) One-factor / CAPM-style

$$ r_{it} = \alpha_i + \beta_i\, R_{mt} + \varepsilon_{it} $$

One observed factor (market return), constant loading $\beta_i$ per stock, constant intercept $\alpha_i$. Classical assumption: $\varepsilon_{it}$ is i.i.d. across stocks and over time — noise covariance $\text{Cov}(\varepsilon)$ is a diagonal matrix, every stock's idiosyncratic shock independent of every other's.

(2) Multi-factor (Fama-French and descendants)

$$ r_{it} = \alpha_i + \beta_i'\, F_t + \varepsilon_{it} $$

$K$ observed factors (market, size, value, momentum, profitability, …). Still assumes $\varepsilon_{it}$ are cross-sectionally uncorrelated — factors are supposed to capture all the comovement.

Clean on the chalkboard. Doesn't survive real data: even after controlling for FF5, residuals across stocks remain correlated — sector-specific news, supply-chain shocks, fund-flow effects all leak through.

(3) Approximate factor model (Chamberlain & Rothschild 1983) — what Bai-Ng uses

Same equation form, but the noise assumption is relaxed:

$$ X_{it} = \lambda_i'\, F_t + e_{it} $$

$e_{it}$ is allowed to have weak cross-sectional and serial correlation — sector spillovers OK, mild persistence OK.
Formal condition: the largest eigenvalue of the $N \times N$ noise covariance matrix $\text{Cov}(e_t)$ stays bounded as $N$ grows.
Meanwhile, the factor part $\lambda_i' F_t$ has variance that grows with $N$ — the top-$K$ eigenvalues of the factor-driven covariance scale linearly in $N$.

The word "approximate" refers to this relaxation. Ross's "exact" APT (1976) demanded uncorrelated $e_{it}$; Chamberlain-Rothschild's "approximate" version allows weak correlation but draws the line at the eigenvalue-scaling gap above.

Why finance needs the relaxation. In equity panels, residuals are always weakly correlated — that's just reality. Forcing exact-factor assumptions makes the model unidentifiable because the math says "if residuals are correlated at all, the structure could explain it instead of factors." Approximate-factor relaxes this so identification works.

Vocabulary that follows

Term	Meaning
Strong factors	Top-$K$ eigenvalues scale linearly in $N$. Signal grows with universe; PCA picks them out cleanly.
Weak factors	Top-$K$ eigenvalues bounded or grow slowly. S/N stays bounded; standard PCA can over-select. (Onatski 2010 handles this regime.)
Common component	$\lambda_i' F_t$ — the part factors explain.
Idiosyncratic component	$e_{it}$ — the rest. Weak correlation allowed, independence not assumed.
Large-$N$, large-$T$ asymptotics	Both panel dimensions grow. Large $N$ identifies factors via cross-section. Large $T$ identifies loadings via time-series.

What Bai-Ng adds

Given this setup, Bai-Ng (2002) asks: how do we choose $K$ from data? Pre-2002, methods were either ad-hoc (scree plot — visual eyeballing of where eigenvalues "elbow") or used standard AIC/BIC criteria that don't have the right asymptotic behaviour when both $N$ and $T$ grow. Bai-Ng give the first consistent criteria for this exact regime.

1.6. Unpacking the basics — what those terms actually mean

§1.5 still relies on terms (covariance matrix, eigenvalue, AIC/BIC, "consistent") that need their own ground-up explanation. Here is each, with finance-concrete examples.

(a) "Noise covariance $\text{Cov}(\varepsilon)$ is a diagonal matrix" — what does that actually say?

A covariance matrix for $N$ stocks' noise terms is an $N \times N$ grid:

The diagonal entry $(i, i)$ = stock $i$'s own variance — size of its idiosyncratic noise.
The off-diagonal entry $(i, j)$ = covariance between stock $i$'s noise and stock $j$'s noise — how much they move together after factors are stripped out.

A diagonal matrix has all off-diagonals = 0 — no two stocks' noises are correlated. Strong claim.

Finance intuition. Are Toyota's and Sony's company-specific shocks truly independent? In real data, no — JPY-USD moves, BoJ policy, global business cycle leak through as "factor-residual" effects. Diagonal breaks. Chamberlain-Rothschild relaxed: no factor-strength correlation, but weak residual correlation is fine.

(b) "Largest eigenvalue bounded vs. linearly scaling in $N$" — what's an eigenvalue and why does the scaling matter?

An eigenvalue of a covariance matrix measures the magnitude of one principal axis of variation. The largest = "how big is the dominant direction of movement".

Eigenvalues of $\text{Cov}(\lambda_i' F_t)$ → magnitudes of the $K$ axes the factors drive.
Eigenvalues of $\text{Cov}(e_t)$ → magnitudes of the residual directions.

Why one grows and the other doesn't.

Add a stock. It loads on the same factors as everyone else. One more vector in the same direction → variance in the factor direction grows by ~1. After $N$ stocks, factor eigenvalue ≈ $N$. Linear in $N$.
Add a stock. Its noise is ~independent. Noise doesn't pile in any one direction. Each noise eigenvalue stays at one-stock scale. Bounded.

Consequence. The gap between factor and noise eigenvalues grows with $N$. With 10 stocks, can't tell. With 5,000 stocks, top factor eigenvalue is hundreds of times larger. This gap is the PCA-and-eigenvalue-ratio identification source.

(c) Vocabulary fleshed out with examples

Strong factor. Market beta. Every stock loads positively. $N$ grows → loadings stack → eigenvalue scales linearly.

Weak factor. 2024 AI-CapEx niche theme. Meaningful for ~50 stocks, ~zero for the other ~4,950. Even at large $N$, eigenvalue stays small. Standard PCA can confuse it with noise.

Common component $\lambda_i' F_t$. Of Toyota's 3% return: market +0.5%, size +0.3%, value −0.1% → 0.7% common.

Idiosyncratic component $e_{it}$. Remaining 2.3% — Toyota-specific news. Not assumed independent of Honda/Nissan; same-industry residual correlation is allowed.

Large-$N$, large-$T$ asymptotics. "Both panel dimensions big enough that asymptotic theorems apply". CRSP US: $N \approx 5{,}000$, $T \approx 700$ — both large. J-Quants JP: $N \approx 4{,}000$, $T \approx 200$ — both large. "S&P 500 over 5 years" ($N = 500$, $T = 60$) is too small for asymptotics.

(d) Why Bai-Ng was needed — what scree plot and AIC/BIC don't do

Scree plot. Chart of eigenvalues in descending order. Eyeball the "elbow" → $\hat K$.

eigenvalue
 ↑
 |  *
 |   *
 |    *  ← elbow at K=3
 |     ·  ·  ·  ·  ·  ·
 |
 +─────────→ rank

Problem: subjective. Two analysts can disagree on $K=3$ vs $K=5$. No reproducibility, no formal test.

AIC / BIC. Standard time-series model-selection: error + penalty × parameter-count.

Problem: derived for "$N$ fixed, $T \to \infty$" or fully-parametric. Factor models grow both $N$ and $T$ and tolerate weakly-correlated residuals. Standard penalty rates don't match the asymptotic behaviour of $V(k)$ here. Result: standard AIC/BIC over- or under-selects — not consistent.

"Consistent" criterion. Statistical term: as $N, T \to \infty$, $\Pr(\hat K = K) \to 1$. Bai-Ng (2002) give the first criteria provably satisfying this in the approximate-factor regime. Their innovation: introduce a new rate condition — penalty must shrink slower than $\min(N,T)^{-1}$ — and design $\text{PC}_p$ / $\text{IC}_p$ families to satisfy it. The rest of the paper (the six formulas in §3) makes this concrete.

2. The model

$$ X_{it} \;=\; \lambda_i' F_t \;+\; e_{it}, \qquad i=1,\dots,N,\; t=1,\dots,T. $$

Symbol	Meaning
$X_{it}$	observed data (e.g., a macro series, an asset return)
$F_t$	$K \times 1$ vector of common factors
$\lambda_i$	$K \times 1$ vector of factor loadings
$e_{it}$	idiosyncratic component (weakly cross-sectionally / serially correlated allowed)

This is an approximate factor model — $e_{it}$ does not need to be i.i.d., just well-behaved. The true $K$ is unknown.

3. The key trick — penalised criteria

Fix a maximum candidate $k_{\max}$. For each $k = 0, 1, \dots, k_{\max}$:

Run PCA on $X$ keeping $k$ factors → estimates $\hat F^k_t$, $\hat\lambda_i^k$.
Compute the residual sum of squares per observation: $$ V(k, \hat F^k) \;=\; \frac{1}{NT} \sum_{i=1}^N \sum_{t=1}^T \bigl(X_{it} - \hat\lambda_i^{k\prime} \hat F^k_t\bigr)^2. $$
Apply a penalty $g(N, T) \cdot k$ that pushes back against overfitting.

$\hat K$ minimises $V(k) + k\, g(N,T)$ (PC family) or $\ln V(k) + k\, g(N,T)$ (IC family).

The six criteria

$$\begin{aligned} \text{PC}_{p1}(k) &= V(k) + k\, \hat\sigma^2 \cdot \tfrac{N+T}{NT} \ln\!\left(\tfrac{NT}{N+T}\right),\\ \text{PC}_{p2}(k) &= V(k) + k\, \hat\sigma^2 \cdot \tfrac{N+T}{NT} \ln C_{NT}^2,\\ \text{PC}_{p3}(k) &= V(k) + k\, \hat\sigma^2 \cdot \tfrac{\ln C_{NT}^2}{C_{NT}^2},\\ \text{IC}_{p1}(k) &= \ln V(k) + k \cdot \tfrac{N+T}{NT} \ln\!\left(\tfrac{NT}{N+T}\right),\\ \text{IC}_{p2}(k) &= \ln V(k) + k \cdot \tfrac{N+T}{NT} \ln C_{NT}^2,\\ \text{IC}_{p3}(k) &= \ln V(k) + k \cdot \tfrac{\ln C_{NT}^2}{C_{NT}^2},\\ \end{aligned}$$

where $C_{NT}^2 = \min(N, T)$ and $\hat\sigma^2$ is the average idiosyncratic variance (typically $V(k_{\max})$).

The penalty must satisfy two rate conditions: (i) shrink to zero as $N, T \to \infty$ — else the criterion always picks $k = 0$; and (ii) shrink slower than $\min(N,T)^{-1}$ — else noise dominates and the criterion always picks $k = k_{\max}$. The six variants are all consistent; they differ in finite-sample behaviour.

4. Consistency

Theorem 2 (Bai & Ng). As $N, T \to \infty$ jointly, $\Pr(\hat K = K) \to 1$ for each of the six criteria, under mild assumptions on factor strength and idiosyncratic correlation.

Sketch: for $k < K$, $V(k)$ is bounded away from $V(K)$ — under-fitting penalised. For $k > K$, $V(k) - V(K) \to 0$ at rate $\min(N,T)^{-1}$, while the penalty shrinks more slowly — over-fitting penalised.

5. Empirical findings

Applied to a Stock-Watson US macro panel (215 series, ~39 years monthly), the criteria consistently select 2 factors. Robust to sub-sample stability, different transformations (levels vs. growth), adding/dropping series. Monte Carlo: reliable when $\min(N,T) \geq 40$; unstable below — important caveat for short panels.

6. Connection to other papers in this series

flowchart TB BN["Bai & Ng (2002)
Penalised IC for K"] B03["Bai (2003)
Asymptotic distributions"] ON["Onatski (2010)
Eigenvalue-ratio
(robust to weak factors)"] KPS["Kelly-Pruitt-Su (2019)
IPCA uses BN for K"] CRW["CRW (2023)
K̂ uses ratio selector
(works for fixed T)"] BN --> B03 BN --> ON BN --> KPS ON --> CRW style BN fill:#fff6e3,stroke:#b8651e

Bai (2003) — inference layer on top of BN's selection. Once you've picked $K$, this tells you the asymptotic distribution of $\hat F$ and $\hat\lambda$.
Onatski (2010) — BN criteria can over-select when factors are weak; eigenvalue-ratio test is more robust in those regimes.
CRW (2023) — eigenvalue-ratio selector that holds for fixed $T$ (BN needs $T \to \infty$). Fixed-$T$ result enables rolling sub-sample analysis.

7. What this gives the AOF model

For the AOF replication / extension stack, Bai-Ng IC is the default $K$-selector when both panel dimensions are large:

Scenario	Recommended selector	Reason
US full panel 1968→today	BN IC_p2	Standard, well-tested, consistent under large $N$, $T$.
US rolling 5-year window	CRW eigenvalue ratio	$T \approx 60$ — too small for BN to be reliable.
JP full panel 1990→today	BN IC_p2, $k_{\max} = 8$	$T \approx 400$, $N \approx 3{,}500$ — well within BN's comfort zone.
JP rolling 5-year window	CRW eigenvalue ratio	Same small-$T$ concern.

Implementation: 30 lines of NumPy. Compute SVD once with $k_{\max}$ components, get $V(k)$ for all $k \leq k_{\max}$ from cumulative explained variance, minimise the criterion.

8. Reading next

Bai (2003) — inference layer; once you've picked $K$, this tells you how confident to be about $\hat F$ and $\hat\lambda$.
Onatski (2010) — what to use when BN over-selects.

← 学習ベース index · 8 本中 1 本目

01 — Bai & Ng (2002)

近似ファクターモデルにおけるファクター数の決定

Econometrica 70(1), 191–221 · 松前景一郎 · 2026-05-15

論文の要点

近似ファクターモデルにおける潜在ファクター数 $K$ の、最初の一致性のある容易計算な推定量を提案。$N$、$T$ ともに大きい場合に成立。
候補数 $k$ で PCA → 残差分散 $V(k)$ → モデルの複雑さに対し、真の $K$ を識別できるだけゆっくり減衰するペナルティを課す。
6 つの具体的基準：$\text{PC}_{p1}, \text{PC}_{p2}, \text{PC}_{p3}$ および $\text{IC}_{p1}, \text{IC}_{p2}, \text{IC}_{p3}$。
Bai (2003) の推測理論、Onatski (2010) の固有値比、CRW (2023) の regressed-PCA のファクター数決定の基盤。
AOF 用途：米国サンプル（大 $N$・大 $T$）に対するデフォルト $K$ セレクター。ローリング窓や日本の小 $T$ 推定では Onatski/CRW に切り替える。

1. なぜこの論文が必要か

CAPM や Fama-French の世界では「何ファクターか？」は理論が決める。近似ファクターモデル（Chamberlain & Rothschild 1983）の場合、$K$ はデータから選ぶしかない。2002 年以前は ad-hoc な方法（スクリープロット、固有値しきい値）か、形式的に非一致な方法（標準 AIC/BIC はパネル両次元が増える場合には機能しない）しかなかった。

Bai & Ng の貢献は、$N, T \to \infty$ で真の $K$ を一致選択するペナルティ付き基準の族を提示したこと。

1.5. 背景 — 「近似ファクターモデル」とは何か

論文ではこの用語が前提知識として使われる。形式設定に入る前に整理しておく。アセットプライシング・モデルは現実度の段階で 3 つに分けられる：

(1) 1 ファクター／CAPM 型

$$ r_{it} = \alpha_i + \beta_i\, R_{mt} + \varepsilon_{it} $$

観測可能なファクター 1 つ（マーケットリターン）、各銘柄に固定ローディング $\beta_i$、固定の切片 $\alpha_i$。古典的仮定：$\varepsilon_{it}$ は銘柄間でも時間軸でも i.i.d. — ノイズ共分散 $\text{Cov}(\varepsilon)$ は対角行列で、各銘柄の個別ショックは他のあらゆる銘柄から独立。

(2) マルチファクター（Fama-French とその系列）

$$ r_{it} = \alpha_i + \beta_i'\, F_t + \varepsilon_{it} $$

観測可能なファクター $K$ 個（マーケット、規模、バリュー、モメンタム、収益性、…）。依然として $\varepsilon_{it}$ は銘柄間で無相関と仮定 — ファクターが共動のすべてを捉えていることが前提。

教科書では綺麗。実データには通用しない：FF5 を控除しても残差は銘柄間で相関する — 業種ニュース、サプライチェーンショック、ファンドフロー効果などが残る。

(3) 近似ファクターモデル（Chamberlain & Rothschild 1983） — Bai-Ng が使う設定

式の形は同じ、しかしノイズの仮定を緩める：

$$ X_{it} = \lambda_i'\, F_t + e_{it} $$

$e_{it}$ に銘柄間および時系列で弱い相関を許容 — 業種波及も、緩やかな自己相関も OK。
形式的条件：$N \times N$ のノイズ共分散行列 $\text{Cov}(e_t)$ の最大固有値が、$N$ を増やしても有界に留まる。
一方、ファクター部分 $\lambda_i' F_t$ の分散は $N$ とともに成長 — ファクター駆動の共分散の上位 $K$ 固有値は $N$ に線形にスケールする。

「近似」という言葉はこの緩和を指す。Ross の「厳密」APT (1976) は $e_{it}$ の無相関を要求した；Chamberlain-Rothschild の「近似」版は弱い相関を許すが、上の固有値スケーリングのギャップで線を引く。

なぜファイナンスでこの緩和が必要か。株式パネルでは残差は常に弱く相関している — それが現実。厳密ファクター仮定を強制すると、「残差が少しでも相関するなら、その構造がファクターの代わりに説明してしまう」とモデルがデータから識別不能になる。近似ファクターはこれを緩めて識別を可能にする。

ここから派生する用語

用語	意味
強い因子	上位 $K$ 固有値が $N$ に線形成長。ユニバースとともにシグナルが成長し、PCA がクリーンに拾える。
弱い因子	上位 $K$ 固有値が有界 or 緩慢な成長。S/N が有界に留まり、標準 PCA は過大選択しうる。（Onatski 2010 が対応。）
共通成分	$\lambda_i' F_t$ — ファクターが説明する分。
個別成分	$e_{it}$ — 残り。弱相関は許容、独立は仮定しない。
大 $N$・大 $T$ 漸近	パネル両次元が成長。大 $N$ がクロスセクションでファクターを識別。大 $T$ が時系列でローディングを識別。

Bai-Ng の貢献

この設定の下で Bai-Ng (2002) が答える問い：$K$ をデータからどう選ぶか？ 2002 年以前は、ad-hoc 手法（スクリープロット — 固有値が「肘」になる箇所を視覚で見る）か、標準 AIC/BIC（$N, T$ 両方が増える領域では正しい漸近的振る舞いを持たない）しかなかった。Bai-Ng はこの領域用に最初の一致性ある基準を与えた。

1.6. 噛み砕いて理解する — 出てきた用語を実例で展開

§1.5 はまだ用語（共分散行列、固有値、AIC/BIC、「一致性」）に依存している。それぞれを底から、ファイナンスの具体例で展開する。

(a) 「ノイズ共分散 $\text{Cov}(\varepsilon)$ は対角行列」とは何を言っているか

$N$ 銘柄のノイズの共分散行列は $N \times N$ の格子状の表：

対角の $(i, i)$ 成分 = 銘柄 $i$ 自身の分散 — 個別ノイズの大きさ。
非対角の $(i, j)$ 成分 = 銘柄 $i$ のノイズと銘柄 $j$ のノイズの共分散 — ファクター除去後にどれだけ一緒に動くか。

対角行列とは非対角成分がすべて 0 — つまりどの 2 銘柄のノイズも互いに無相関という強い主張。

ファイナンスの直感。 トヨタとソニーの「企業特有のショック」は本当に独立か？実データでは違う — 円ドル、日銀政策、グローバル景気サイクルが「ファクター残差」効果として両方に同時に効く。対角は破綻する。Chamberlain-Rothschild はこれを緩めた：ファクター強度の相関は NG だが、残差間の弱い相関は OK。

(b) 「最大固有値が有界 vs N に線形」の意味 — 固有値とは何か、スケーリングはなぜ重要か

固有値とは、共分散行列の主要な変動方向の大きさを測る数値。最大固有値 = 「データが最も大きく振れる方向の振幅」。

$\text{Cov}(\lambda_i' F_t)$ の固有値 → ファクターが効く $K$ 個の主方向の振幅。
$\text{Cov}(e_t)$ の固有値 → 残差の各方向の振幅。

なぜ片方は成長し片方は留まるか。

銘柄を 1 つ追加。新しい銘柄も他の全銘柄と同じファクターにロードする。同じ方向に向かうベクトルが 1 つ増える → 分散が約 1 単位だけ増える。$N$ 銘柄ならファクター固有値 ≈ $N$。$N$ に線形。
銘柄を 1 つ追加。そのノイズは他の全銘柄のノイズとほぼ独立。特定方向に積み上がらない。各ノイズ固有値は 1 銘柄分のまま。$N$ について有界。

帰結。 ファクター固有値とノイズ固有値のギャップが $N$ とともに広がる。$N = 10$ では区別ほぼ不能。$N = 5{,}000$ ではトップ固有値はノイズの数百倍。このギャップが PCA と固有値比セレクター $\hat K$ の識別の根拠。

(c) 用語を実例で肉付け

強い因子。 マーケットベータ。全銘柄が大なり小なり正にロード。$N$ を増やすと同方向のロードが積み重なる → 固有値が $N$ に線形成長。

弱い因子。 2024 年 AI CapEx ナラティブのようなニッチセクターテーマ。50 銘柄程度には効くが、残り 4,950 銘柄ではほぼゼロ。$N$ を増やしても固有値は伸びない。標準 PCA はノイズと取り違えうる。

共通成分 $\lambda_i' F_t$。 トヨタの今月リターン 3% のうち、マーケット +0.5%、規模 +0.3%、バリュー −0.1% → 合計 0.7% が共通成分。

個別成分 $e_{it}$。 残り 2.3% — トヨタ固有のニュース。ホンダや日産のノイズと完全独立とは仮定しない；同業残差相関は許容。

大 $N$・大 $T$ 漸近。 「パネル両次元が漸近定理が効く程度に十分大きい」状態。CRSP 米国：$N \approx 5{,}000$、$T \approx 700$ — 両方大。J-Quants JP：$N \approx 4{,}000$、$T \approx 200$ — 両方大。「S&P 500 × 5 年」だと $N = 500$、$T = 60$ — 両方小、漸近は効かない。

(d) なぜ Bai-Ng が必要だったか — スクリープロットと AIC/BIC が効かない理由

スクリープロット (scree plot)。 固有値を大きい順にプロットしたグラフ。視覚的に「肘 (elbow)」がある場所を $\hat K$ とする。

固有値
 ↑
 |  *
 |   *
 |    *  ← elbow at K=3
 |     ·  ·  ·  ·  ·  ·
 |
 +─────────→ 順位

問題：主観的。同じスクリープロットで $K=3$ か $K=5$ かで分析者が割れる。再現性なし、形式的検定なし。

AIC / BIC。 時系列計量経済学の標準モデル選択基準。誤差 + ペナルティ × パラメータ数 の形。

問題：「$N$ 固定、$T \to \infty$」または完全パラメトリック前提で導出された。ファクターモデルでは $N$ も $T$ も増え、残差が弱相関を許す。標準ペナルティ速度はこの領域の $V(k)$ 漸近と合わない。結果、標準 AIC/BIC は $k = k_{\max}$ を常に選ぶか $k = 0$ を常に選ぶかで、一致性がない。

「一致性ある (consistent) 基準」。 統計用語で「サンプルサイズ $N, T$ が ∞ に発散すると $\Pr(\hat K = K) \to 1$」。Bai-Ng (2002) は近似ファクター設定でこれを満たす最初の基準を与えた。技術的貢献：「ペナルティは $\min(N,T)^{-1}$ より遅く 0 に収束する必要がある」という新しい速度条件を導入し、$\text{PC}_p$ / $\text{IC}_p$ 族を設計してこれを満たした。後続の §3 にある 6 つの式がこれを具体化する。

2. モデル

$$ X_{it} \;=\; \lambda_i' F_t \;+\; e_{it}, \qquad i=1,\dots,N,\; t=1,\dots,T. $$

記号	意味
$X_{it}$	観測データ（例：マクロ系列、資産リターン）
$F_t$	$K \times 1$ の共通ファクター
$\lambda_i$	$K \times 1$ のファクターローディング
$e_{it}$	個別成分（クロス・系列で弱相関は許容）

これは近似ファクターモデル — $e_{it}$ は i.i.d. である必要はなく、十分に弱い相関構造があれば良い。$K$ は未知。

3. 鍵となる工夫 — ペナルティ付き基準

候補の上限 $k_{\max}$ を固定。各 $k = 0, 1, \dots, k_{\max}$ について：

$X$ に $k$ 個のファクターで PCA → $\hat F^k_t$、$\hat\lambda_i^k$ を得る。
観測あたり残差二乗和： $$ V(k, \hat F^k) \;=\; \frac{1}{NT} \sum_{i=1}^N \sum_{t=1}^T \bigl(X_{it} - \hat\lambda_i^{k\prime} \hat F^k_t\bigr)^2. $$
過学習を抑えるペナルティ $g(N, T) \cdot k$ を適用。

$\hat K$ は $V(k) + k\, g(N,T)$（PC 族）または $\ln V(k) + k\, g(N,T)$（IC 族）を最小化する $k$。

6 つの基準

ここで $C_{NT}^2 = \min(N, T)$、$\hat\sigma^2$ は平均個別分散（通常 $V(k_{\max})$）。

ペナルティは 2 つの速度条件を満たす必要がある：(i) $N, T \to \infty$ で 0 に収束（さもなくば常に $k=0$ を選ぶ）；(ii) $\min(N,T)^{-1}$ よりも遅く収束（さもなくばノイズが支配し常に $k=k_{\max}$ を選ぶ）。6 つはいずれも一致性を持つが、有限サンプル性能が異なる。

4. 一致性

定理 2（Bai & Ng）。$N, T \to \infty$ で同時に発散する場合、6 基準すべてについて $\Pr(\hat K = K) \to 1$（穏当な仮定のもと）。

概略：$k < K$ では $V(k)$ は $V(K)$ から有界に離れている（過小選択にペナルティ）。$k > K$ では $V(k) - V(K) \to 0$ が $\min(N,T)^{-1}$ のオーダーで起きるが、ペナルティはそれより遅く収束（過大選択にペナルティ）。

5. 実証結果

Stock-Watson 型の米国マクロパネル（215 系列、約 39 年・月次）に適用すると、6 基準のほとんどが2 ファクターを選択。サブサンプル安定性、変換（水準 vs. 成長率）、系列の追加・削除に対しロバスト。モンテカルロ：$\min(N,T) \geq 40$ で信頼可能。それ以下では不安定 — 短パネル応用での重要な注意点。

6. 本シリーズ内での位置づけ

flowchart TB BN["Bai & Ng (2002)
K のペナルティ付き IC"] B03["Bai (2003)
漸近分布"] ON["Onatski (2010)
固有値比
（弱因子にロバスト）"] KPS["Kelly-Pruitt-Su (2019)
IPCA は K 選択に BN を利用"] CRW["CRW (2023)
K̂ に比型セレクター
（固定 T で機能）"] BN --> B03 BN --> ON BN --> KPS ON --> CRW style BN fill:#fff6e3,stroke:#b8651e

Bai (2003) — BN の選択層の上に推測層を追加。$K$ を選んだあと、$\hat F$ と $\hat\lambda$ の漸近分布を与える。
Onatski (2010) — BN 基準はファクターが弱い場合に過大選択しがち。固有値比検定はそうした領域でよりロバスト。
CRW (2023) — 固定 $T$ で成立する固有値比セレクター（BN は $T \to \infty$ を要求）。固定 $T$ の結果はローリングサブサンプル分析を可能にする。

7. AOF モデルへの貢献

AOF のレプリケーション・拡張スタックにおいて、Bai-Ng IC は両次元が大きい場合のデフォルト $K$ セレクター：

シナリオ	推奨セレクター	理由
米国フルパネル 1968→現在	BN IC_p2	標準的、十分にテスト済、大 $N$・大 $T$ で一致性。
米国 5 年ローリング窓	CRW 固有値比	$T \approx 60$ — BN には小さすぎる。
日本フルパネル 1990→現在	BN IC_p2, $k_{\max} = 8$	$T \approx 400$、$N \approx 3{,}500$ — BN の comfort zone 内。
日本 5 年ローリング窓	CRW 固有値比	同じく小 $T$ への配慮。

実装：NumPy で 30 行程度。$k_{\max}$ 成分で一度 SVD を計算し、累積寄与率から $V(k)$ を得て、基準を最小化。

8. 次に読むべきもの

Bai (2003) — 推測層。$K$ を選んだあとの $\hat F$ と $\hat\lambda$ の信頼度。
Onatski (2010) — BN が過大選択する場合の代替手段。