From Unstructured Data to Actionable Insights using IBM watsonx.data

Written by Venkata Bachu | Thu, Nov, 13, 2025 @ 08:30 PM

Introduction

In today’s data-driven world, organizations often sit on massive volumes of unstructured information in PDFs, logs, text files, JSONs. While the files hold valuable information, business analytics tools like IBM Planning Analytics (PA) rely on structured, queryable data to deliver insights. That's where IBM watsonx.data and IBM PA come together. In this article, we will explore how unstructured data can be processed in watsonx.data, and made available in IBM PA for planning and analysis.

This post walks through an end-to-end workflow that converts unstructured data into structured tables in watsonx.data and connects them seamlessly to IBM PA for advanced planning and forecasting.

Understanding the Architecture

IBM watsonx.data is a modern, open data lakehouse that can store and query structured, semi-structured, and unstructured data.
IBM Planning Analytics, on the other hand, can only consume structured data (tables, columns, rows).
To connect the two, unstructured data needs to be converted into structured tables that IBM PA can read. This connection is established via watsonx.data’s Presto/Trino query engine or through DataStage or watsonx.ai pipelines.

Step 1: Ingest Unstructured Data into IBM watsonx.data

Start by uploading your raw files— PDFs, JSONs, logs, etc.— to IBM Cloud Object Storage (COS). IBM Watsonx.data integrates directly with COS, allowing you to register a COS bucket as a schema or catalog.

Example SQL:
CREATE SCHEMA IF NOT EXISTS unstructured_data
WITH (location = 's3a://my-bucket/unstructured/');

You can also use ETL tools like DataStage or watsonx.ai pipelines to create ingestion jobs that extract metadata or text and store the results in structured formats such as Parquet or CSV.

Step 2: Process and Structure the Data

Once your files are in COS, you’ll need to make them analyzable by transforming them into tabular form.

Options include:

Watsonx.ai: Use LLMs or NLP models to extract structured fields like categories, summaries, or sentiments.
IBM DataStage: Build data flow pipelines to parse and transform JSON, XML, or text files.
Python SDK + watsonx.data: Use Python (pandas, PySpark, etc.) to extract and process the data.

Save this structured data back into watsonx.data for querying.

Step 3: Create Structured Tables for IBM PA

After processing, register your structured files (CSV, Parquet, or Delta) as tables in watsonx.data:

Sql like:
       CREATE TABLE processed_data (
              column1 VARCHAR,
              column2 VARCHAR,
              column3 VARCHAR
           )
           WITH (
                 format = 'PARQUET',
                 external_location = 's3a://my-bucket/processed/'
           );

Confirm you can query the table in watsonx.data’s SQL editor.

Step 4: Connect IBM Planning Analytics to watsonx.data

Now that you have structured, queryable tables, connect IBM PA to watsonx.data either:

Directly via ODBC/JDBC to the Presto/Trino endpoint.
Through IBM DataStage or Watson Query as an intermediary layer.

Example JDBC connection:
jdbc:trino://<watsonx-data-endpoint>:443/catalog/schema

Once the connection is set, import your structured tables into IBM PA cubes for use in forecasting, reporting, or scenario analysis.

Example: End-to-End Workflow from PDFs to PA

1. Upload PDFs to COS
s3a://my-bucket/raw_pdfs/

2. Extract Text using Python and watsonx.data SDK
   from PyPDF2 import PdfReader
   import pandas as pd

   def extract_pdf_text(file):
       reader = PdfReader(file)
       return " ".join([page.extract_text() for page in reader.pages if page.extract_text()])

   data = [{"filename": f, "text": extract_pdf_text(f)} for f in ["File1.pdf", "File2.pdf"]]
   pd.DataFrame(data).to_csv("pdf_texts.csv", index=False)

3. Upload Structured Text to watsonx.data
   CREATE TABLE pdf_raw_texts (
     filename VARCHAR,
     text VARCHAR
   )
   WITH (
     format = 'CSV',
     external_location = 's3a://my-bucket/structured/'
   );

4. Optional: Enrich Data with NLP
   from ibm_watsonx_ai.foundation_models import Model

   model = Model(model_id="ibm/granite-13b-instruct", api_key="your_api_key")
   df["summary"] = df["text"].apply(lambda t: model.generate(prompt=f"Summarize this:\n{t[:5000]}"))
   df[["filename", "summary"]].to_csv("pdf_summaries.csv", index=False)

5. Register and Query in watsonx.data
   CREATE TABLE pdf_summaries (
     filename VARCHAR,
     summary VARCHAR
   )
   WITH (
     format = 'CSV',
     external_location = 's3a://my-bucket/structured/pdf_summaries/'
   );

6. Import into IBM PA
   - In IBM PA Workspace, go to Data Sources → Add Connection
   - Choose JDBC/ODBC
   - Enter watsonx.data connection info
   - Import pdf_summaries into your cubes for analysis.

Conclusion

By combining IBM watsonx.data’s scalable data lakehouse with IBM PA’s modeling and forecasting, you can unlock insights from even your most complex unstructured datasets.

This integration enables a full data lifecycle, from raw files to enriched insights, all within the IBM ecosystem.

View full post