How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training

In this tutorial, we discover how we use BudouX to carry clever, phrase-aware line breaking to languages the place whitespace will not be naturally current, equivalent to Japanese, Chinese, and Thai. We start by establishing the library and working with its default parsers to perceive how uncooked textual content is segmented into significant chunks. We then transfer into HTML transformation, the place we visually see how BudouX improves readability in constrained layouts by inserting invisible breakpoints. As we progress, we dive deeper into the underlying mannequin, inspecting its discovered options and weights to perceive how choices are made. We additionally experiment with customized mannequin manipulation, combine BudouX into sensible workflows like line wrapping and JSON-based pipelines, and consider its efficiency. Also, we construct a minimal end-to-end coaching pipeline to achieve instinct about how such light-weight ML fashions are constructed.

Copy Code

import subprocess, sys
def pip(*pkgs):
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *pkgs])
pip("budoux")


import json, time, textwrap, html, random, re, os, tempfile
from pathlib import Path
import budoux
from IPython.show import HTML, show, Markdown


print(f" BudouX model: {budoux.__version__ if hasattr(budoux,'__version__') else 'put in'}")


def header(title):
   show(Markdown(f"## {title}"))


header("1⃣ Default parsers — Japanese / Chinese (Simplified & Traditional) / Thai")


samples = {
   "Japanese (ja)":           ("今日は天気です。BudouXは機械学習を用いた改行整形ツールです。",
                               budoux.load_default_japanese_parser()),
   "Simplified Chinese":      ("今天是晴天。BudouX 是一个使用机器学习的换行整理工具。",
                               budoux.load_default_simplified_chinese_parser()),
   "Traditional Chinese":     ("今天是晴天。BudouX 是一個使用機器學習的換行整理工具。",
                               budoux.load_default_traditional_chinese_parser()),
   "Thai (th)":               ("วันนี้อากาศดีมากและฉันอยากออกไปเดินเล่นที่สวนสาธารณะ",
                               budoux.load_default_thai_parser()),
}
for title, (textual content, parser) in samples.objects():
   chunks = parser.parse(textual content)
   print(f"n• {title}")
   print(f"  uncooked   : {textual content}")
   print(f"  parsed:  '.be part of(chunks)    ({len(chunks)} phrases)")

We set up BudouX and arrange all required imports to start working with the library. We load default parsers for a number of languages and move pattern sentences by means of them to observe how the textual content is segmented into significant phrases. This helps us perceive the core performance of BudouX and the way it handles totally different linguistic constructions out of the field.

Copy Code

header("2⃣ HTML translation with `translate_html_string`")


ja_parser = budoux.load_default_japanese_parser()
html_in = "今日は<b>とても天気</b>です。"
html_out = ja_parser.translate_html_string(html_in)
seen = html_out.change("u200b", "·")
print("Input  HTML :", html_in)
print("Output HTML :", html_out)
print("Visualised  :", seen)


demo_text = ("BudouXは機械学習を用いて、CJK言語の文章を意味のある"
            "フレーズに分割し、自然な位置で改行できるようにします。")
demo_html = ja_parser.translate_html_string(demo_text)
show(HTML(f"""
<div type="show:flex; hole:16px; font-family:'Hiragino Sans',sans-serif;">
 <div type="width:140px; border:2px stable #c33; padding:8px;">
    <b type="shade:#c33;"> Plain</b><br>{demo_text}
 </div>
 <div type="width:140px; border:2px stable #2a8; padding:8px;">
    <b type="shade:#2a8;"> BudouX</b><br>{demo_html}
 </div>
</div>
"""))


header("3⃣ Model introspection — options & weights")


model_dir = Path(budoux.__file__).dad or mum / "fashions"
print("Bundled fashions:", [p.name for p in model_dir.glob("*.json")])


with open(model_dir / "ja.json", encoding="utf-8") as f:
   ja_model = json.load(f)


print(f"nFeature classes in ja.json: {checklist(ja_model.keys())}")
complete = sum(len(v) for v in ja_model.values())
print(f"Total discovered options: {complete:,}")
for cat, feats in ja_model.objects():
   print(f"  • {cat:5s}  → {len(feats):,} options")


flat = [(cat, feat, w) for cat, d in ja_model.items() for feat, w in d.items()]
flat.type(key=lambda x: x[2], reverse=True)
print("nTop 5 options that vote 'BREAK HERE':")
for cat, feat, w in flat[:5]:
   print(f"  [{cat}] {feat!r}  → weight={w}")
print("nTop 5 options that vote 'DO NOT BREAK':")
for cat, feat, w in flat[-5:]:
   print(f"  [{cat}] {feat!r}  → weight={w}")

We use BudouX to remodel HTML strings by inserting invisible breakpoints that enhance textual content wrapping. We visualize the impact by evaluating plain textual content rendering with BudouX-enhanced output in a constrained structure. We additionally examine the interior mannequin construction, exploring characteristic classes and weights to perceive how the segmentation choices are discovered.

Copy Code

header("4⃣ Loading a customized mannequin with `budoux.Parser(mannequin)`")


neutered = {cat: {ok: 0 for ok in d} for cat, d in ja_model.objects()}
flat_parser = budoux.Parser(neutered)
print("All-zero mannequin output :", flat_parser.parse("今日は天気です。"))
print("Default mannequin output  :", ja_parser.parse("今日は天気です。"))


header("5⃣ Practical: customized separators, line-wrapping, JSON export")


def wrap_with_budoux(textual content, parser, max_width=12, sep="n"):
   traces, present = [], ""
   for phrase in parser.parse(textual content):
       if len(present) + len(phrase) > max_width and present:
           traces.append(present); present = phrase
       else:
           present += phrase
   if present: traces.append(present)
   return sep.be part of(traces)


novel = ("吾輩は猫である。名前はまだ無い。どこで生れたかとんと見当がつかぬ。"
        "何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。")
print("Wrapped at width 12:")
print(wrap_with_budoux(novel, ja_parser, max_width=12))


seg = {"textual content": novel, "phrases": ja_parser.parse(novel)}
print("nJSON payload (first 120 chars):", json.dumps(seg, ensure_ascii=False)[:120], "...")

We experiment with a customized mannequin by modifying all characteristic weights to zero and observing how segmentation habits modifications. We then implement a sensible text-wrapping perform that respects BudouX phrase boundaries for higher readability. Finally, we export the segmented output as JSON, making it simple to combine into downstream programs or front-end purposes.

Copy Code

header("6⃣ Performance benchmark")


big_text = novel * 200
t0 = time.perf_counter()
phrases = ja_parser.parse(big_text)
elapsed = time.perf_counter() - t0
print(f"Parsed {len(big_text):,} chars → {len(phrases):,} phrases "
     f"in {elapsed*1000:.1f} ms  ({len(big_text)/elapsed/1000:.0f}ok chars/sec)")


header("7⃣ Mini end-to-end coach (toy demo)")


training_lines = [
   "私は▁遅刻魔で、▁待ち合わせに▁いつも▁遅刻して▁しまいます。",
   "メールで▁待ち合わせ▁相手に▁一言、▁「ごめんね」と▁謝れば▁どうにか▁なると▁思って▁いました。",
   "海外では▁ケータイを▁持って▁いない。",
   "今日は▁とても▁いい▁天気です。",
   "明日は▁雨が▁降る▁かも▁しれません。",
   "週末は▁友達と▁映画を▁見に▁行きます。",
] * 20


SEP = "u2581"


def extract_features(s, i):
   def g(idx): return s[idx] if 0 <= idx < len(s) else ""
   feats = []
   for off in (-3,-2,-1,0,1,2):
       feats.append(f"U{off}:{g(i+off)}")
   for off in (-2,-1,0,1):
       feats.append(f"B{off}:{g(i+off)}{g(i+off+1)}")
   for off in (-1,0):
       feats.append(f"T{off}:{g(i+off)}{g(i+off+1)}{g(i+off+2)}")
   return feats


def make_examples(traces):
   X, y = [], []
   for line in traces:
       clear = line.change(SEP, "")
       breaks = set()
       j = 0
       for ch in line:
           if ch == SEP: breaks.add(j)
           else: j += 1
       for i in vary(1, len(clear)):
           X.append(extract_features(clear, i))
           y.append(1 if i in breaks else -1)
   return X, y


X, y = make_examples(training_lines)
print(f"Training examples: {len(X)}  (positives: {sum(1 for v in y if v==1)})")

We benchmark BudouX’s efficiency to consider its effectivity in processing massive quantities of textual content. We then start setting up a minimal coaching pipeline by getting ready labeled information and extracting options round potential breakpoints. This provides us perception into how coaching information is structured and how options contribute to segmentation choices.

Copy Code

def adaboost(X, y, rounds=80):
   n = len(y)
   w = [1/n]*n
   feat_set = sorted({f for fx in X for f in fx})
   fmap = [set(fx) for fx in X]
   model_rounds = []
   for r in vary(rounds):
       best_feat, best_err, best_pol = None, 1.0, 1
       for f in feat_set:
           err_pos = sum(w[i] for i in vary(n) if (f in fmap[i]) != (y[i]==1))
           err_neg = 1 - err_pos
           if err_pos < best_err: best_feat, best_err, best_pol = f, err_pos, +1
           if err_neg < best_err: best_feat, best_err, best_pol = f, err_neg, -1
       if best_err >= 0.5 - 1e-9: break
       eps = max(best_err, 1e-6)
       alpha = 0.5 * ( (1-eps)/eps ) ** 0.5
       new_w = []
       for i in vary(n):
           pred = best_pol if best_feat in fmap[i] else -best_pol
           new_w.append(w[i] * (0.5 if pred == y[i] else 2.0))
       s = sum(new_w); w = [x/s for x in new_w]
       model_rounds.append((best_feat, best_pol, alpha))
   return model_rounds


print("Training (this can be a toy coach — be affected person ~10s)...")
t0 = time.perf_counter()
rounds = adaboost(X, y, rounds=60)
print(f"Done in {time.perf_counter()-t0:.1f}s, {len(rounds)} stumps stored.")


appropriate = 0
for fx, label in zip(X, y):
   rating = sum(a if (f in fx) == (p==1) else -a for f,p,a in rounds)
   pred = 1 if rating > 0 else -1
   appropriate += (pred == label)
print(f"Training accuracy of toy mannequin: {appropriate/len(X)*100:.1f}%")
print(" For a manufacturing mannequin, use `scripts/prepare.py` from the BudouX repo with the matching characteristic extractor — this part is illustrative.")


header("8⃣ Real-world demo — slim column comparability")


paragraph = ("BudouXはGoogleが開発したオープンソースの改行ライブラリです。"
            "機械学習モデルを使って、文章を意味のあるフレーズに分割し、"
            "読みやすい位置でのみ改行が起こるようにします。"
            "依存関係がなく軽量なため、ウェブサイトやモバイルアプリに"
            "簡単に組み込むことができます。")
show(HTML(f"""
<div type="show:flex; hole:24px; font-family:'Hiragino Sans','Yu Gothic',sans-serif; font-size:15px;">
 <div type="flex:1; border:2px stable #c33; padding:12px; max-width:180px;">
   <b type="shade:#c33;">Without BudouX</b>
   <p type="line-height:1.7;">{paragraph}</p>
 </div>
 <div type="flex:1; border:2px stable #2a8; padding:12px; max-width:180px;">
   <b type="shade:#2a8;">With BudouX</b>
   <p type="line-height:1.7;">{ja_parser.translate_html_string(paragraph)}</p>
 </div>
</div>
<p type="font-size:12px;shade:#666;">Resize the browser/Colab pane to see the distinction extra clearly — BudouX by no means breaks a phrase mid-word.</p>
"""))


print("n Tutorial full. Try plugging BudouX output into your personal UI.")

We implement a easy AdaBoost-based coaching loop to construct a toy segmentation mannequin from scratch. We consider the mannequin’s accuracy to perceive how properly it learns phrase boundaries from the info. Finally, we current a real-world comparability that reveals how BudouX improves readability in slim layouts, reinforcing its sensible worth.

In conclusion, we developed a complete understanding of how BudouX applies machine studying to resolve the nuanced drawback of pure line breaking in CJK and comparable languages. We noticed the way it operates effectively with out heavy dependencies, making it splendid for internet and cellular integrations. Through hands-on exploration, from parsing and HTML rendering to mannequin introspection, customization, and even coaching, we discovered how to use BudouX and additionally how to lengthen and adapt it for our personal use circumstances. This equips us with each the sensible instruments and conceptual readability wanted to incorporate phrase-aware textual content segmentation into real-world purposes with confidence.

Check out the Full Codes here. Find 100s of ML/Data Science Colab Notebooks here. Also, be happy to comply with us on Twitter and don’t overlook to be part of our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The submit How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training appeared first on MarkTechPost.