Watsonx.BI is the new IBM interface to allow users to speak to data. It provides a more intuitive...
From Unstructured Data to Actionable Insights using IBM watsonx.data
Introduction
In today’s data-driven world, organizations often sit on massive volumes of unstructured information in PDFs, logs, text files, JSONs. While the files hold valuable information, business analytics tools like IBM Planning Analytics (PA) rely on structured, queryable data to deliver insights. That's where IBM watsonx.data and IBM PA come together. In this article, we will explore how unstructured data can be processed in watsonx.data, and made available in IBM PA for planning and analysis.
This post walks through an end-to-end workflow that converts unstructured data into structured tables in watsonx.data and connects them seamlessly to IBM PA for advanced planning and forecasting.
This post walks through an end-to-end workflow that converts unstructured data into structured tables in watsonx.data and connects them seamlessly to IBM PA for advanced planning and forecasting.
Understanding the Architecture
- IBM watsonx.data is a modern, open data lakehouse that can store and query structured, semi-structured, and unstructured data.
- IBM Planning Analytics, on the other hand, can only consume structured data (tables, columns, rows).
-
To connect the two, unstructured data needs to be converted into structured tables that IBM PA can read. This connection is established via watsonx.data’s Presto/Trino query engine or through DataStage or watsonx.ai pipelines.
Step 1: Ingest Unstructured Data into IBM watsonx.data
Start by uploading your raw files— PDFs, JSONs, logs, etc.— to IBM Cloud Object Storage (COS). IBM Watsonx.data integrates directly with COS, allowing you to register a COS bucket as a schema or catalog.
Example SQL:
CREATE SCHEMA IF NOT EXISTS unstructured_data
WITH (location = 's3a://my-bucket/unstructured/');
You can also use ETL tools like DataStage or watsonx.ai pipelines to create ingestion jobs that extract metadata or text and store the results in structured formats such as Parquet or CSV.
Example SQL:
CREATE SCHEMA IF NOT EXISTS unstructured_data
WITH (location = 's3a://my-bucket/unstructured/');
You can also use ETL tools like DataStage or watsonx.ai pipelines to create ingestion jobs that extract metadata or text and store the results in structured formats such as Parquet or CSV.
Step 2: Process and Structure the Data
Once your files are in COS, you’ll need to make them analyzable by transforming them into tabular form.
Options include:
- Watsonx.ai: Use LLMs or NLP models to extract structured fields like categories, summaries, or sentiments.
- IBM DataStage: Build data flow pipelines to parse and transform JSON, XML, or text files.
- Python SDK + watsonx.data: Use Python (pandas, PySpark, etc.) to extract and process the data.
Example output:
| document_id | category | summary | sentiment |
| document_id | category | summary | sentiment |
Save this structured data back into watsonx.data for querying.
Step 3: Create Structured Tables for IBM PA
After processing, register your structured files (CSV, Parquet, or Delta) as tables in watsonx.data:
Sql like:
CREATE TABLE processed_data (
column1 VARCHAR,
column2 VARCHAR,
column3 VARCHAR
)
WITH (
format = 'PARQUET',
external_location = 's3a://my-bucket/processed/'
);
Confirm you can query the table in watsonx.data’s SQL editor.
CREATE TABLE processed_data (
column1 VARCHAR,
column2 VARCHAR,
column3 VARCHAR
)
WITH (
format = 'PARQUET',
external_location = 's3a://my-bucket/processed/'
);
Confirm you can query the table in watsonx.data’s SQL editor.
Step 4: Connect IBM Planning Analytics to watsonx.data
Now that you have structured, queryable tables, connect IBM PA to watsonx.data either:
- Directly via ODBC/JDBC to the Presto/Trino endpoint.
- Through IBM DataStage or Watson Query as an intermediary layer.
Example JDBC connection:
jdbc:trino://<watsonx-data-endpoint>:443/catalog/schema
Once the connection is set, import your structured tables into IBM PA cubes for use in forecasting, reporting, or scenario analysis.
Example: End-to-End Workflow from PDFs to PA
1. Upload PDFs to COS
s3a://my-bucket/raw_pdfs/
s3a://my-bucket/raw_pdfs/
2. Extract Text using Python and watsonx.data SDK
from PyPDF2 import PdfReader
import pandas as pd
def extract_pdf_text(file):
reader = PdfReader(file)
return " ".join([page.extract_text() for page in reader.pages if page.extract_text()])
data = [{"filename": f, "text": extract_pdf_text(f)} for f in ["File1.pdf", "File2.pdf"]]
pd.DataFrame(data).to_csv("pdf_texts.csv", index=False)
from PyPDF2 import PdfReader
import pandas as pd
def extract_pdf_text(file):
reader = PdfReader(file)
return " ".join([page.extract_text() for page in reader.pages if page.extract_text()])
data = [{"filename": f, "text": extract_pdf_text(f)} for f in ["File1.pdf", "File2.pdf"]]
pd.DataFrame(data).to_csv("pdf_texts.csv", index=False)
3. Upload Structured Text to watsonx.data
CREATE TABLE pdf_raw_texts (
filename VARCHAR,
text VARCHAR
)
WITH (
format = 'CSV',
external_location = 's3a://my-bucket/structured/'
);
CREATE TABLE pdf_raw_texts (
filename VARCHAR,
text VARCHAR
)
WITH (
format = 'CSV',
external_location = 's3a://my-bucket/structured/'
);
4. Optional: Enrich Data with NLP
from ibm_watsonx_ai.foundation_models import Model
model = Model(model_id="ibm/granite-13b-instruct", api_key="your_api_key")
df["summary"] = df["text"].apply(lambda t: model.generate(prompt=f"Summarize this:\n{t[:5000]}"))
df[["filename", "summary"]].to_csv("pdf_summaries.csv", index=False)
5. Register and Query in watsonx.data
CREATE TABLE pdf_summaries (
filename VARCHAR,
summary VARCHAR
)
WITH (
format = 'CSV',
external_location = 's3a://my-bucket/structured/pdf_summaries/'
);
CREATE TABLE pdf_summaries (
filename VARCHAR,
summary VARCHAR
)
WITH (
format = 'CSV',
external_location = 's3a://my-bucket/structured/pdf_summaries/'
);
6. Import into IBM PA
- In IBM PA Workspace, go to Data Sources → Add Connection
- Choose JDBC/ODBC
- Enter watsonx.data connection info
- Import pdf_summaries into your cubes for analysis.
- In IBM PA Workspace, go to Data Sources → Add Connection
- Choose JDBC/ODBC
- Enter watsonx.data connection info
- Import pdf_summaries into your cubes for analysis.
Conclusion
By combining IBM watsonx.data’s scalable data lakehouse with IBM PA’s modeling and forecasting, you can unlock insights from even your most complex unstructured datasets.
This integration enables a full data lifecycle, from raw files to enriched insights, all within the IBM ecosystem.