Close Menu
  • Home
  • Opinion
  • Region
    • Africa
    • Asia
    • Europe
    • Middle East
    • North America
    • Oceania
    • South America
  • AI & Machine Learning
  • Robotics & Automation
  • Space & Deep Tech
  • Web3 & Digital Economies
  • Climate & Sustainability Tech
  • Biotech & Future Health
  • Mobility & Smart Cities
  • Global Tech Pulse
  • Cybersecurity & Digital Rights
  • Future of Work & Education
  • Trend Radar & Startup Watch
  • Creator Economy & Culture
What's Hot

Amazon lays off extra employees, this time in its robotics division

March 6, 2026

Tesla formally confirms the 6-seater Mannequin Y L for Australia and New Zealand

March 6, 2026

Luxurious Dubai condominium offered for AED422M

March 6, 2026
Facebook X (Twitter) Instagram LinkedIn RSS
NextTech NewsNextTech News
Facebook X (Twitter) Instagram LinkedIn RSS
  • Home
  • Africa
  • Asia
  • Europe
  • Middle East
  • North America
  • Oceania
  • South America
  • Opinion
Trending
  • Amazon lays off extra employees, this time in its robotics division
  • Tesla formally confirms the 6-seater Mannequin Y L for Australia and New Zealand
  • Luxurious Dubai condominium offered for AED422M
  • Meet the Quadruple Star System That Defies Area Expectations
  • Hydrogen Materials Specialist Shengshui Tech Secures Over USD 13.7 Million Collection A
  • Atlanta transit company to launch on-demand service Saturday
  • Google begins cracking down on apps that drain your battery
  • 5 indicators it’s time to automate your palletizing course of
Friday, March 6
NextTech NewsNextTech News
Home - AI & Machine Learning - A Coding Information to Construct a Scalable Finish-to-Finish Machine Studying Knowledge Pipeline Utilizing Daft for Excessive-Efficiency Structured and Picture Knowledge Processing
AI & Machine Learning

A Coding Information to Construct a Scalable Finish-to-Finish Machine Studying Knowledge Pipeline Utilizing Daft for Excessive-Efficiency Structured and Picture Knowledge Processing

NextTechBy NextTechMarch 6, 2026No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email Copy Link
Follow Us
Google News Flipboard
A Coding Information to Construct a Scalable Finish-to-Finish Machine Studying Knowledge Pipeline Utilizing Daft for Excessive-Efficiency Structured and Picture Knowledge Processing
Share
Facebook Twitter LinkedIn Pinterest Email


On this tutorial, we discover how we use Daft as a high-performance, Python-native knowledge engine to construct an end-to-end analytical pipeline. We begin by loading a real-world MNIST dataset, then progressively remodel it utilizing UDFs, function engineering, aggregations, joins, and lazy execution. Additionally, we reveal the best way to seamlessly mix structured knowledge processing, numerical computation, and machine studying. By the tip, we’re not simply manipulating knowledge, we’re constructing a whole model-ready pipeline powered by Daft’s scalable execution engine.

!pip -q set up daft pyarrow pandas numpy scikit-learn


import os
os.environ["DO_NOT_TRACK"] = "true"


import numpy as np
import pandas as pd
import daft
from daft import col


print("Daft model:", getattr(daft, "__version__", "unknown"))


URL = "https://github.com/Eventual-Inc/mnist-json/uncooked/grasp/mnist_handwritten_test.json.gz"


df = daft.read_json(URL)
print("nSchema (sampled):")
print(df.schema())


print("nPeek:")
df.present(5)

We set up Daft and its supporting libraries immediately in Google Colab to make sure a clear, reproducible surroundings. We configure non-obligatory settings and confirm the put in model to substantiate every part is working appropriately. By doing this, we set up a steady basis for constructing our end-to-end knowledge pipeline.

def to_28x28(pixels):
   arr = np.array(pixels, dtype=np.float32)
   if arr.dimension != 784:
       return None
   return arr.reshape(28, 28)


df2 = (
   df
   .with_column(
       "img_28x28",
       col("picture").apply(to_28x28, return_dtype=daft.DataType.python())
   )
   .with_column(
       "pixel_mean",
       col("img_28x28").apply(lambda x: float(np.imply(x)) if x isn't None else None,
                              return_dtype=daft.DataType.float32())
   )
   .with_column(
       "pixel_std",
       col("img_28x28").apply(lambda x: float(np.std(x)) if x isn't None else None,
                              return_dtype=daft.DataType.float32())
   )
)


print("nAfter reshaping + easy options:")
df2.choose("label", "pixel_mean", "pixel_std").present(5)

We load a real-world MNIST JSON dataset immediately from a distant URL utilizing Daft’s native reader. We examine the schema and preview the information to know its construction and column varieties. It permits us to validate the dataset earlier than making use of transformations and have engineering.

@daft.udf(return_dtype=daft.DataType.record(daft.DataType.float32()), batch_size=512)
def featurize(images_28x28):
   out = []
   for img in images_28x28.to_pylist():
       if img is None:
           out.append(None)
           proceed
       img = np.asarray(img, dtype=np.float32)
       row_sums = img.sum(axis=1) / 255.0
       col_sums = img.sum(axis=0) / 255.0
       whole = img.sum() + 1e-6
       ys, xs = np.indices(img.form)
       cy = float((ys * img).sum() / whole) / 28.0
       cx = float((xs * img).sum() / whole) / 28.0
       vec = np.concatenate([row_sums, col_sums, np.array([cy, cx, img.mean()/255.0, img.std()/255.0], dtype=np.float32)])
       out.append(vec.astype(np.float32).tolist())
   return out


df3 = df2.with_column("options", featurize(col("img_28x28")))


print("nFeature column created (record[float]):")
df3.choose("label", "options").present(2)

We reshape the uncooked pixel arrays into structured 28×28 photographs utilizing a row-wise UDF. We compute statistical options, such because the imply and commonplace deviation, to counterpoint the dataset. By making use of these transformations, we convert uncooked picture knowledge into structured and model-friendly representations.

label_stats = (
   df3.groupby("label")
      .agg(
          col("label").depend().alias("n"),
          col("pixel_mean").imply().alias("mean_pixel_mean"),
          col("pixel_std").imply().alias("mean_pixel_std"),
      )
      .kind("label")
)


print("nLabel distribution + abstract stats:")
label_stats.present(10)


df4 = df3.be a part of(label_stats, on="label", how="left")


print("nJoined label stats again onto every row:")
df4.choose("label", "n", "mean_pixel_mean", "mean_pixel_std").present(5)

We implement a batch UDF to extract richer function vectors from the reshaped photographs. We carry out group-by aggregations and be a part of abstract statistics again to the dataset for contextual enrichment. This demonstrates how we mix scalable computation with superior analytics inside Daft.

small = df4.choose("label", "options").accumulate().to_pandas()


small = small.dropna(subset=["label", "features"]).reset_index(drop=True)


X = np.vstack(small["features"].apply(np.array).values).astype(np.float32)
y = small["label"].astype(int).values


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


clf = LogisticRegression(max_iter=1000, n_jobs=None)
clf.match(X_train, y_train)


pred = clf.predict(X_test)
acc = accuracy_score(y_test, pred)


print("nBaseline accuracy (feature-engineered LogisticRegression):", spherical(acc, 4))
print("nClassification report:")
print(classification_report(y_test, pred, digits=4))


out_df = df4.choose("label", "options", "pixel_mean", "pixel_std", "n")
out_path = "/content material/daft_mnist_features.parquet"
out_df.write_parquet(out_path)


print("nWrote parquet to:", out_path)


df_back = daft.read_parquet(out_path)
print("nRead-back test:")
df_back.present(3)

We materialize chosen columns into pandas and prepare a baseline Logistic Regression mannequin. We consider efficiency to validate the usefulness of our engineered options. Additionally, we persist the processed dataset to Parquet format, finishing our end-to-end pipeline from uncooked knowledge ingestion to production-ready storage.

On this tutorial, we constructed a production-style knowledge workflow utilizing Daft, shifting from uncooked JSON ingestion to function engineering, aggregation, mannequin coaching, and Parquet persistence. We demonstrated the best way to combine superior UDF logic, carry out environment friendly groupby and be a part of operations, and materialize outcomes for downstream machine studying, all inside a clear, scalable framework. By means of this course of, we noticed how Daft allows us to deal with complicated transformations whereas remaining Pythonic and environment friendly. We completed with a reusable, end-to-end pipeline that showcases how we are able to mix fashionable knowledge engineering and machine studying workflows in a unified surroundings.


Take a look at the Full Codes right here. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 120k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be a part of us on telegram as properly.


Michal Sutter is a knowledge science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at remodeling complicated datasets into actionable insights.

Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a worldwide community of future-focused thinkers.
Unlock tomorrow’s developments immediately: learn extra, subscribe to our publication, and turn out to be a part of the NextTech group at NextTech-news.com

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
NextTech
  • Website

Related Posts

Google AI Releases a CLI Instrument (gws) for Workspace APIs: Offering a Unified Interface for People and AI Brokers

March 6, 2026

OpenAI Releases Symphony: An Open Supply Agentic Framework for Orchestrating Autonomous AI Brokers by Structured, Scalable Implementation Runs

March 5, 2026

The best way to Design an Superior Tree-of-Ideas Multi-Department Reasoning Agent with Beam Search, Heuristic Scoring, and Depth-Restricted Pruning

March 5, 2026
Add A Comment
Leave A Reply Cancel Reply

Economy News

Amazon lays off extra employees, this time in its robotics division

By NextTechMarch 6, 2026

The corporate minimize 16,000 jobs as lately as January. In a brand new spherical of…

Tesla formally confirms the 6-seater Mannequin Y L for Australia and New Zealand

March 6, 2026

Luxurious Dubai condominium offered for AED422M

March 6, 2026
Top Trending

Amazon lays off extra employees, this time in its robotics division

By NextTechMarch 6, 2026

The corporate minimize 16,000 jobs as lately as January. In a brand…

Tesla formally confirms the 6-seater Mannequin Y L for Australia and New Zealand

By NextTechMarch 6, 2026

On February twenty third, I introduced you the information of the Mannequin…

Luxurious Dubai condominium offered for AED422M

By NextTechMarch 6, 2026

  Sale hailed as main signal of confidence in metropolis’s actual property…

Subscribe to News

Get the latest sports news from NewsSite about world, sports and politics.

NEXTTECH-LOGO
Facebook X (Twitter) Instagram YouTube

AI & Machine Learning

Robotics & Automation

Space & Deep Tech

Web3 & Digital Economies

Climate & Sustainability Tech

Biotech & Future Health

Mobility & Smart Cities

Global Tech Pulse

Cybersecurity & Digital Rights

Future of Work & Education

Creator Economy & Culture

Trend Radar & Startup Watch

News By Region

Africa

Asia

Europe

Middle East

North America

Oceania

South America

2025 © NextTech-News. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms Of Service
  • Advertise With Us
  • Write For Us
  • Submit Article & Press Release

Type above and press Enter to search. Press Esc to cancel.

Subscribe For Latest Updates

Sign up to best of Tech news, informed analysis and opinions on what matters to you.

Invalid email address
 We respect your inbox and never send spam. You can unsubscribe from our newsletter at any time.     
Thanks for subscribing!