preprocessing for time-series
Preprocessing Stock Price Time-Series for Neural Networks
What Format is the Raw Data?
Normally, stock price data looks like this:
Date | Open | High | Low | Close | Volume |
---|---|---|---|---|---|
2025-01-01 | 102.5 | 105.0 | 101.0 | 104.0 | 5000000 |
2025-01-02 | 104.2 | 106.5 | 103.5 | 105.5 | 4800000 |
Critical point: Time order must be preserved. You cannot shuffle the rows.
What Does It Mean to "Normalize Prices"?
Normalization means scaling the numeric values into a smaller, consistent range.
Neural networks train better when inputs are scaled uniformly to prevent large numbers from dominating small numbers.Benefits:
- Prevents numerical instability.
- Ensures faster, smoother training.
- Keeps gradients stable during backpropagation.
MinMaxScaler vs StandardScaler
MinMaxScaler
- Scales data to a fixed range, usually
[0, 1]
. - Formula:
x' = (x - min(x)) / (max(x) - min(x))
- Example: If Open prices range from 90 to 110, then 90 maps to 0, 110 maps to 1, and 100 maps to 0.5.
StandardScaler
- Scales data to have zero mean and unit variance (standard Gaussian distribution).
- Formula:
x' = (x - mean) / standard_deviation
- Used when modeling data relative to the average behavior.
How Should a System Admin Preprocess the Data?
- Load the data from CSV, database, or API.
- Check for missing dates or fields.
- Handle missing data:
- Market holidays are normal (no trading).
- Missing values in features (Open, High, etc.) must be filled or interpolated.
- Normalize the features:
- Apply
MinMaxScaler
orStandardScaler
to each column separately. - Save the scaler for inverse transformation later.
- Frame into sequences:
- Use past 60 days as input to predict the next day's close price.
How to Frame the Data into Sequences
Suppose your dataset is 100 days long:
- First input (X) = days 1-60; Target (y) = day 61 Close price.
- Second input (X) = days 2-61; Target (y) = day 62 Close price.
In pseudocode:
for i in range(60, len(data)):
X.append(data[i-60:i]) # 60-day slice
y.append(data[i]['Close']) # Target is next day's Close
How to Handle Missing Data
- Single missing values: Use forward-fill:
df.fillna(method='ffill', inplace=True)
df.interpolate(method='linear', inplace=True)
Quick Summary Table
Task | Method |
---|---|
Normalize prices | MinMaxScaler [0,1] or StandardScaler (mean=0, std=1) |
Data format | Tabular with Date, Open, High, Low, Close, Volume |
Frame sequences | Past 60 days input to predict next day's Close |
Handle missing data | Forward-fill or interpolate; preserve time order |
Important Warning
You must apply the same normalization parameters (scaler) to both training and test data.
Train the scaler on the training set only, then apply it to unseen (validation/test) data.
Otherwise, you introduce data leakage and corrupt your results.
Optional Next Step
Would you like an example of a small Perl script for data preprocessing
(forward-fill, scaling) or a Python snippet using pandas and sklearn?