2025-10-04

Reconstructing the KRX Order Book for Historical Back-Testing

By Woojae Jeon · Topics: order book, KRX data, tick data, slippage modeling

KRX historical order book reconstruction methodology for back-testing

Public KRX (한국거래소) data, as available through the KRX Data Service and most third-party data vendors, delivers daily OHLCV — open, high, low, close, and volume. What it does not deliver is intraday order book depth: the bid-ask spread at order time, the available depth at each price level, or the sequence of fills that constitute a day's volume. For a back-testing framework that wants to model realistic execution rather than mid-price fantasy fills, this gap is the central engineering challenge.

This post describes the methodology Finology uses to reconstruct approximate bid-ask surfaces from available KRX historical data, the assumptions that reconstruction requires, and where the approximation breaks down for specific market conditions or liquidity tiers.

Note on methodology limits: The reconstruction approach described here is an approximation. We are not claiming it replicates full Level 2 order book data — we are saying it produces estimates that are meaningfully more accurate than mid-price fills for the liquidity conditions typical of KOSPI 200 rotation strategies.

What KRX Historical Data Actually Provides

KRX Data Service provides several data tiers beyond basic OHLCV. The most relevant for order book reconstruction are:

Tick-level trade data (체결 데이터): Individual trade records with timestamp, price, and volume. Available for a rolling historical window through the KRX Data Service API; longer archives require institutional data subscriptions.
Best bid/ask snapshots: Periodic snapshots of the best bid price, best ask price, and available depth at those levels, typically captured at 1-second or 5-second intervals in the current data feed. Historical tick-level snapshots beyond a rolling 2-year window are not available in standard KRX data products.
End-of-day trading statistics: Volume weighted average price (VWAP, 거래량 가중평균가격), number of trades, and tick range, available historically back to 2002–2005 for most names.

The reconstruction problem is: given daily OHLCV plus tick-level trade data where available, and end-of-day statistics for older history, how do we estimate the intraday bid-ask spread that a strategy would have faced at order time?

The Roll's Spread Estimator and KRX Tick Data

The foundational methodology for bid-ask spread estimation from trade-level data is Roll's spread estimator (Roll 1984), which uses the serial covariance of transaction price changes to back-estimate the effective spread. The intuition: if transaction prices bounce between bid and ask in a pure-noise market, the first-order autocovariance of price changes is negative and proportional to the half-spread squared.

import pandas as pd
import numpy as np

def roll_spread(prices: pd.Series) -> float:
    """
    Roll (1984) effective spread estimator.
    Input: series of trade prices (chronological).
    Returns: estimated full bid-ask spread in price units.
    """
    price_changes = prices.diff().dropna()
    cov = price_changes.cov(price_changes.shift(1).dropna())
    if cov >= 0:
        # Non-negative covariance: spread estimate undefined, use 0
        return 0.0
    return 2 * np.sqrt(-cov)

# Example: estimate spread from one day's tick data for a single KOSPI name
# tick_prices = pd.Series([...])  # chronological trade prices from KRX tick feed
# estimated_spread = roll_spread(tick_prices)
# print(f"Roll estimated spread: {estimated_spread:.2f} KRW")

Roll's estimator works reasonably well for liquid KOSPI 200 large-caps where the price bounce pattern is dominated by spread crossing rather than order flow imbalance. It is less reliable for mid-caps where large directional trades dominate the tick record, because the negative autocovariance assumption breaks down when prices trend intraday.

The Corwin-Schultz High-Low Estimator

For historical periods where only daily OHLCV is available (pre-2018 for most tick-level coverage), Finology uses the Corwin-Schultz (2012) spread estimator, which derives an estimate from the daily high-low range:

def corwin_schultz_spread(high: pd.Series, low: pd.Series) -> pd.Series:
    """
    Corwin-Schultz (2012) spread estimator using daily H-L ranges.
    Returns series of estimated daily spreads.
    """
    beta = (np.log(high / low)) ** 2
    beta_sum = beta + beta.shift(1)
    gamma = (np.log(high.rolling(2).max() / low.rolling(2).min())) ** 2
    alpha = (np.sqrt(2 * beta_sum) - np.sqrt(beta_sum)) / (3 - 2 * np.sqrt(2)) - \
            np.sqrt(gamma / (3 - 2 * np.sqrt(2)))
    spread = 2 * (np.exp(alpha) - 1) / (1 + np.exp(alpha))
    return spread.clip(lower=0)

The Corwin-Schultz estimator requires at least two consecutive days of OHLCV and is calibrated for markets where the high-low range is a reasonable proxy for order flow impact. For KOSPI, the estimator shows consistent performance for large-caps but tends to understate effective spreads for KOSDAQ names where intraday gaps driven by thin order books inflate the H-L range beyond pure spread effects.

Depth Estimation: Beyond Spread

Spread estimation gives us the cost of crossing the bid-ask once. For a rotation strategy placing orders of non-trivial size relative to daily volume, the relevant cost is market impact — how much the price moves as order size exceeds the available depth at the best quotes.

Depth estimation from daily data relies on the relationship between trade size and intraday price impact. The approach Finology uses is a simplified Amihud (2002) illiquidity ratio applied at the daily level, calibrated against available intraday depth snapshots for the period where snapshot data exists, then projected backward for older history using the Amihud coefficient as an anchor:

def amihud_illiquidity(returns: pd.Series, volume_krw: pd.Series,
                        window: int = 20) -> pd.Series:
    """
    Amihud (2002) illiquidity ratio.
    Returns rolling measure of price impact per KRW of volume.
    Higher value = less liquid (more price impact per unit of flow).
    """
    daily_ratio = returns.abs() / volume_krw
    return daily_ratio.rolling(window).mean()

# For order size impact estimation:
# estimated_impact_bps = amihud_illiquidity_value * order_size_krw * 10000

Where the Reconstruction Breaks Down

No reconstruction from daily data is a substitute for actual order book data, and there are specific conditions where the approximation materially understates true execution costs.

Rebalance-day crowding: KOSPI 200 quarterly reconstitution and MSCI quarterly rebalance create concentrated institutional order flow on specific dates. On those dates, the effective spread for affected names is 2–5x wider than normal-day estimates because competing orders deplete available depth at the best quotes. A model calibrated on normal-day Corwin-Schultz estimates will understate rebalance-day costs by a factor of 2 or more for the names most affected by institutional index rebalancing.

Gap openings after corporate events: Earnings announcements (실적 발표), analyst target price revisions, and regulatory news often cause gap openings where the first trade of the day is materially above or below the previous close. On gap-open days, the H-L range overstates the tradeable spread because the gap is price discovery, not liquidity cost. Corwin-Schultz applied naively to gap-open days produces artificially large spread estimates that do not correspond to actual execution friction.

Pre-2005 data: KRX data quality and consistency before 2005 is meaningfully lower than the post-2005 period. Corporate action adjustments are less complete, trading halts are more frequent and less systematically flagged, and the Corwin-Schultz estimator calibration is less reliable. Finology's standard back-test window begins in 2005; extending back to 2000 or earlier is possible but disclosed as lower-confidence on spread estimation.

Practical Output: How Reconstruction Feeds Into the Back-Test

The reconstruction pipeline produces a daily spread estimate and depth estimate for each name in the back-test universe. These feed into the fill model as follows: the simulated fill price for a buy order is the estimated ask price (close + half-spread); for larger orders, an additional market impact adjustment is applied based on order size relative to estimated available depth.

The output of the slippage model is transparent in Finology's back-test reports — each rebalance includes a per-position slippage estimate showing the basis-point cost attributed to spread crossing and depth impact separately. This breakdown is diagnostic: if spread costs are consistently dominating for a particular strategy, it points toward calendar timing or universe selection adjustments. If depth impact costs are dominant, it points toward position sizing constraints.

The full methodology for order book reconstruction, including the estimator choice rationale and calibration against KRX tick data, is documented on the Methodology page.