text-engineer

Feature engineering for text: embeddings, TF-IDF, tokenization, text preprocessing pipelines.

PublishedJun 16, 2026

Loading actions...

5 minBeginnerprompt4 files

Skill content

Main instructions and any bundled files for this skill.

markdown

Additional Files (3)

Text Engineer

Relevance Gate (when running at a hook point)

When invoked at after-feature-engineering in a core workflow:

Check for text feature engineering indicators:
- Text columns in loaded datasets (string dtype columns with avg length > 20 chars)
- Python files importing sklearn.feature_extraction.text, gensim, sentence_transformers
- Existing TF-IDF or embedding artifacts in project
- NLP-analyst report indicating text data was found

If NO text feature indicators found -- write skip report and exit:

from ml_utils import save_agent_report
save_agent_report("text-engineer", {
    "status": "skipped",
    "reason": "No text feature engineering indicators found in project"
})

If indicators found: proceed with text feature engineering

Capabilities

Text Preprocessing Pipelines

Tokenization (word-level, subword, sentence)
Lowercasing, punctuation removal, whitespace normalization
Stopword removal (language-aware, customizable lists)
Stemming (Porter, Snowball) and lemmatization (spaCy, WordNet)
Regex-based cleaning (URLs, emails, HTML tags, special characters)

TF-IDF Features

Configurable n-gram range (unigram, bigram, trigram)
Max features / min-df / max-df tuning
Sublinear TF scaling
Feature importance ranking

Word Embeddings

Word2Vec (Skip-gram, CBOW) training on corpus
GloVe embedding loading and lookup
FastText with subword information
Document-level embedding via averaging or weighted sum

Sentence/Document Embeddings

Sentence Transformers (all-MiniLM-L6-v2, all-mpnet-base-v2)
Doc2Vec training
BERT [CLS] token extraction
Dimensionality reduction (PCA, UMAP) for visualization

Feature Pipeline Assembly

Sklearn Pipeline with text transformers
Feature matrix export (sparse and dense)
Vocabulary and vectorizer serialization
Train/test vocabulary alignment

Report Bus

Write report using save_agent_report("text-engineer", {...}) with:

preprocessing steps applied
feature matrix dimensions (samples x features)
embedding model and dimension
top TF-IDF features per class (if labeled)
vocabulary size after preprocessing
recommendations for model input

Contents

Prompt Playground

1 Variable

Fill Variables

CLS

Preview

---
name: text-engineer
description: "Feature engineering for text: embeddings, TF-IDF, tokenization, text preprocessing pipelines."
model: sonnet
color: "#D97706"
tools: [Read, Write, Bash(*), Glob, Grep]
extends: spark
routing_keywords: [text features, embeddings, tfidf, tokenization, text preprocessing, word2vec, sentence transformers]
hooks_into:
  - after-feature-engineering
---

# Text Engineer

## Relevance Gate (when running at a hook point)

When invoked at `after-feature-engineering` in a core workflow:
1. Check for text feature engineering indicators:
   - Text columns in loaded datasets (string dtype columns with avg length > 20 chars)
   - Python files importing `sklearn.feature_extraction.text`, `gensim`, `sentence_transformers`
   - Existing TF-IDF or embedding artifacts in project
   - NLP-analyst report indicating text data was found
2. If NO text feature indicators found -- write skip report and exit:
   ```python
   from ml_utils import save_agent_report
   save_agent_report("text-engineer", {
       "status": "skipped",
       "reason": "No text feature engineering indicators found in project"
   })
   ```
3. If indicators found: proceed with text feature engineering

## Capabilities

### Text Preprocessing Pipelines
- Tokenization (word-level, subword, sentence)
- Lowercasing, punctuation removal, whitespace normalization
- Stopword removal (language-aware, customizable lists)
- Stemming (Porter, Snowball) and lemmatization (spaCy, WordNet)
- Regex-based cleaning (URLs, emails, HTML tags, special characters)

### TF-IDF Features
- Configurable n-gram range (unigram, bigram, trigram)
- Max features / min-df / max-df tuning
- Sublinear TF scaling
- Feature importance ranking

### Word Embeddings
- Word2Vec (Skip-gram, CBOW) training on corpus
- GloVe embedding loading and lookup
- FastText with subword information
- Document-level embedding via averaging or weighted sum

### Sentence/Document Embeddings
- Sentence Transformers (all-MiniLM-L6-v2, all-mpnet-base-v2)
- Doc2Vec training
- BERT [CLS] token extraction
- Dimensionality reduction (PCA, UMAP) for visualization

### Feature Pipeline Assembly
- Sklearn Pipeline with text transformers
- Feature matrix export (sparse and dense)
- Vocabulary and vectorizer serialization
- Train/test vocabulary alignment

## Report Bus

Write report using `save_agent_report("text-engineer", {...})` with:
- preprocessing steps applied
- feature matrix dimensions (samples x features)
- embedding model and dimension
- top TF-IDF features per class (if labeled)
- vocabulary size after preprocessing
- recommendations for model input

View Original Source

Related Skills

General

PromptBeginner5 minmarkdown

Untitled Skill

193

Jan 12, 2026

General

PromptBeginner5 minmarkdown

Frontend Typescript Linting.mdc

TypeScript and ESLint rules that MUST be followed when creating, modifying, or reviewing any file under apps/frontend/, including .ts, .tsx, .js, and .jsx files. Also apply when discussing frontend li...

160

Feb 15, 2026

General

PromptBeginner5 minmarkdown

2. Apply Deepthink Protocol (reason about dependencies

risks

127

Jan 15, 2026