Spaces:

daniel-wojahn
/

ttm-webapp-hf

Sleeping

App Files Files Community

ttm-webapp-hf / README.md

daniel-wojahn

length ratio removal

66ee8c0 2 days ago

preview code

raw

history blame contribute delete

20.1 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

metadata

title: Tibetan Text Metrics
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
python_version: 3.11
app_file: app.py

Tibetan Text Metrics Web App

Compare Tibetan texts to discover how similar they are. This tool helps scholars identify shared passages, textual variations, and relationships between different versions of Tibetan manuscripts — no programming required.

Quick Start (3 Steps)

Upload two or more Tibetan text files (.txt format)
Click "Compare My Texts"
View the results — higher scores mean more similarity

That's it! The default settings work well for most cases. See the results section for colorful heatmaps showing which chapters are most similar.

Tip: If your texts have chapters, separate them with the ༈ marker so the tool can compare chapter-by-chapter.

What's New (v0.4.0)

New preset-based UI: Choose "Quick Start" for simple analysis or "Custom" for full control
Three analysis presets: Standard, Deep (with AI), and Quick (fastest)
Word-level tokenization is now the default (recommended for Jaccard similarity)
Particle normalization: Treat grammatical particle variants as equivalent (གི/ཀྱི/གྱི → གི)
LCS normalization options: Choose how to handle texts of different lengths
Improved stopword matching: Fixed tsek (་) handling for consistent filtering
Tibetan-optimized fuzzy matching: Syllable-level methods only (removed character-level methods)
Dharmamitra models: Buddhist-specific semantic similarity models as default
Modernized theme: Cleaner UI with better responsive design

Background

The Tibetan Text Metrics project provides quantitative methods for assessing textual similarities at the chapter or segment level, helping researchers understand patterns of textual evolution. This web application makes these capabilities accessible through an intuitive interface — no command-line or Python experience needed.

Key Features of the Web App

Easy File Upload: Upload one or more Tibetan .txt files directly through the browser.
Automatic Segmentation: Uses Tibetan section markers (e.g., ༈) to automatically split texts into comparable chapters or sections.
Core Metrics Computed:
- Jaccard Similarity (%): Measures vocabulary overlap between segments. Word-level tokenization recommended. Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.
- Normalized Longest Common Subsequence (LCS): Identifies the longest shared sequence of words, indicating direct textual parallels. Supports multiple normalization modes (average, min, max).
- Fuzzy Similarity: Uses syllable-level fuzzy matching to detect approximate matches, accommodating spelling variations and scribal differences in Tibetan text.
- Semantic Similarity: Uses Buddhist-specific sentence-transformer embeddings (Dharmamitra) to compare the contextual meaning of segments.
Handles Long Texts: Implements automated handling for long segments when computing semantic embeddings.
Model Selection: Semantic similarity uses Hugging Face sentence-transformer models. Default is Dharmamitra's buddhist-nlp/buddhist-sentence-similarity, trained specifically for Buddhist texts.
Tokenization Modes:
- Word (default, recommended): Keeps multi-syllable words together for more meaningful comparison
- Syllable: Splits into individual syllables for finer-grained analysis
Stopword Filtering: Three levels of filtering for Tibetan words:
- None: No filtering, includes all words
- Standard: Filters only common particles and punctuation
- Aggressive: Filters all function words including particles, pronouns, and auxiliaries
Particle Normalization: Optional normalization of grammatical particles to canonical forms (e.g., གི/ཀྱི/གྱི → གི, ལ/ར/སུ/ཏུ/དུ → ལ). Reduces false negatives from sandhi variation.
Interactive Visualizations:
- Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
- Bar chart displaying word counts per segment.
- Vocabulary containment chart showing what percentage of each text's unique vocabulary appears in the other text (directional metric).
Advanced Interpretation: Get scholarly insights about your results with a built-in analysis engine that:
- Examines your metrics and provides contextual interpretation of textual relationships
- Generates a dual-layer narrative analysis (scholarly and accessible)
- Identifies patterns across chapters and highlights notable textual relationships
- Connects findings to Tibetan textual studies concepts (transmission lineages, regional variants)
- Suggests questions for further investigation
Downloadable Results: Export detailed metrics as a CSV file and save heatmaps as PNG files.
Simplified Workflow: No command-line interaction or Python scripting needed for analysis.

Advanced Features

Using AI-Powered Analysis

The application includes an "Interpret Results" button that provides scholarly insights about your text similarity metrics. This feature:

Dynamic model selection: Automatically discovers available free models from OpenRouter (Qwen, Google Gemma, Meta Llama, Mistral, DeepSeek)
Requires an OpenRouter API key (set via environment variable OPENROUTER_API_KEY)
Falls back to rule-based analysis if no API key is provided or all models fail
The AI will provide a comprehensive scholarly analysis including:
- Introduction explaining the texts compared and general observations
- Overall patterns across all chapters with visualized trends
- Detailed examination of notable chapters (highest/lowest similarity)
- Discussion of what different metrics reveal about textual relationships
- Conclusions suggesting implications for Tibetan textual scholarship
- Specific questions these findings raise for further investigation
- Cautionary notes about interpreting perfect matches or zero similarity scores

Data Processing

Automatic Filtering: The system automatically filters out perfect matches (1.0 across all metrics) that may result from empty cells or identical text comparisons
Robust Analysis: The system handles edge cases and provides meaningful metrics even with imperfect data

Text Segmentation and Best Practices

Why segment your texts?

To obtain meaningful results, it is highly recommended to divide your Tibetan texts into logical chapters or sections before uploading. Comparing entire texts as a single unit often produces shallow or misleading results, especially for long or complex works. Chapters or sections allow the tool to detect stylistic, lexical, or structural differences that would otherwise be hidden.

How to segment your texts:

Use the Tibetan section marker (༈ (sbrul shad)) to separate chapters/sections in your .txt files.
Each segment should represent a coherent part of the text (e.g., a chapter, legal clause, or thematic section).
The tool will automatically split your file on this marker for analysis. If no marker is found, the entire file is treated as a single segment, and a warning will be issued.

Best practices:

Ensure the marker is unique and does not appear within a chapter.
Try to keep chapters/sections of similar length for more balanced comparisons.
For poetry or short texts, consider grouping several poems or stanzas as one segment.

Implemented Metrics

Stopword Filtering: To enhance the accuracy and relevance of similarity scores, the Jaccard Similarity and Fuzzy Similarity calculations incorporate a stopword filtering step. This process removes high-frequency, low-information Tibetan words (e.g., common particles, pronouns, and grammatical markers) before the metrics are computed. Stopwords are normalized to handle tsek (་) variations consistently.

Particle Normalization: Tibetan grammatical particles change form based on the preceding syllable (sandhi). For example, the genitive particle appears as གི, ཀྱི, གྱི, ཡི, or འི depending on context. When particle normalization is enabled, all variants are treated as equivalent, reducing false negatives when comparing texts with different scribal conventions.

The comprehensive list of Tibetan stopwords used is adapted and compiled from the following valuable resources:

The Divergent Discourses (specifically, their Tibetan stopwords list available at Zenodo Record 10148636).
The Tibetan Lucene Analyser by the Buddhist Digital Archives (BUDA), available on GitHub: buda-base/lucene-bo.

We extend our gratitude to the creators and maintainers of these projects for making their work available to the community.

Feel free to edit this list of stopwords to better suit your needs. The list is stored in the pipeline/stopwords_bo.py file.

The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:

Jaccard Similarity (%): This metric quantifies the lexical overlap between two text segments by comparing their sets of unique words, optionally filtering out common Tibetan stopwords. It essentially answers the question: 'Of all the distinct, meaningful words found across these two segments, what proportion of them are present in both?' It is calculated as (Number of common unique meaningful words) / (Total number of unique meaningful words in both texts combined) * 100. Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent. A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.

Stopword Filtering: Three levels of filtering are available:

None: No filtering, includes all words in the comparison
Standard: Filters only common particles and punctuation
Aggressive: Filters all function words including particles, pronouns, and auxiliaries

This helps focus on meaningful content words rather than grammatical elements.

Normalized LCS (Longest Common Subsequence): This metric measures the length of the longest sequence of words that appears in both text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text.

Normalization options:
- Average (default): Divides LCS length by the average of both text lengths. Balanced comparison.
- Min: Divides by the shorter text length. Useful for detecting if one text contains the other (e.g., quotes within commentary). Can return 1.0 if shorter text is fully contained.
- Max: Divides by the longer text length. Stricter metric that penalizes length differences.
A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism.

Note on Interpretation: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary.
Fuzzy Similarity: This metric uses syllable-level fuzzy matching algorithms to detect approximate matches, making it particularly valuable for Tibetan texts where spelling variations, dialectal differences, or scribal errors might be present. Unlike exact matching methods (such as Jaccard), fuzzy similarity can recognize when words are similar but not identical.

Available methods (all work at syllable level):
- Syllable N-gram Overlap (default, recommended): Compares syllable bigrams between texts. Best for detecting shared phrases and local patterns.
- Syllable-level Edit Distance: Computes Levenshtein distance at the syllable/token level. Detects minor variations while respecting syllable boundaries.
- Weighted Jaccard: Like standard Jaccard but considers token frequency, giving more weight to frequently shared terms.
Scores range from 0 to 1, where 1 indicates perfect or near-perfect matches. All methods work at the syllable level, which is linguistically appropriate for Tibetan.

Stopword Filtering: The same three levels of filtering used for Jaccard Similarity are applied to fuzzy matching:

None: No filtering, includes all words in the comparison
Standard: Filters only common particles and punctuation
Aggressive: Filters all function words including particles, pronouns, and auxiliaries

Semantic Similarity: Computes the cosine similarity between sentence-transformer embeddings of text segments. Uses Dharmamitra's Buddhist-specific models by default. Segments are embedded into high-dimensional vectors and compared via cosine similarity. Scores closer to 1 indicate a higher degree of semantic overlap.

Note: Semantic similarity operates on the raw text and is not affected by stopword filtering settings.

Visualization Metrics

Vocabulary Containment: A directional metric showing what percentage of one text's unique vocabulary appears in the other text. Unlike Jaccard (which is symmetric), containment is calculated in both directions:
- "Text A → Text B" answers: "What % of Text A's unique words also appear in Text B?"
- Calculated as: (shared vocabulary size) / (source text vocabulary size) × 100
Interpreting asymmetric containment:
- If "Base Text → Commentary" is 95% but "Commentary → Base Text" is 60%, the commentary contains almost all of the base text's vocabulary plus additional words
- This pattern suggests an expansion or commentary relationship
- Useful for identifying which text is the "base" version (its vocabulary will be highly contained in expanded versions)

Getting Started (if run Locally)

Ensure you have Python 3.10 or newer.
Navigate to the webapp directory:
```
cd path/to/tibetan-text-metrics/webapp
```

Create a virtual environment (recommended):

python -m venv .venv
source .venv/bin/activate  # On macOS/Linux
# .venv\Scripts\activate    # On Windows

Install dependencies:
```
pip install -r requirements.txt
```
Compile Cython Extension (Recommended for Performance): To speed up the Longest Common Subsequence (LCS) calculation, a Cython extension is provided. To compile it:
```
# Ensure you are in the webapp directory
python setup.py build_ext --inplace
```
This step requires a C compiler. If you skip this, the application will use a slower, pure Python implementation for LCS.
Run the Web Application:
```
python app.py
```
Open your web browser and go to the local URL provided (usually http://127.0.0.1:7860).

Usage

Quick Start (Recommended for Most Users)

Upload Files: Select one or more .txt files containing Tibetan Unicode text.
Choose a Preset: In the "Quick Start" tab, select an analysis type:

Preset	What it does	Best for
Standard	Vocabulary + Sequences + Fuzzy matching	Most comparisons
Deep	All metrics including AI meaning analysis	Finding semantic parallels
Quick	Vocabulary overlap only	Fast initial scan

Click "Compare My Texts": Results appear below with heatmaps and downloadable CSV.

Custom Analysis (Advanced Users)

For fine-grained control, use the "Custom" tab:

Lexical Metrics: Configure tokenization (word/syllable), stopword filtering, and particle normalization
Sequence Matching (LCS): Enable/disable and choose normalization mode (avg/min/max)
Fuzzy Matching: Choose method (N-gram, Syllable Edit, or Weighted Jaccard)
Semantic Analysis: Enable AI-based meaning comparison with model selection

Viewing Results

Metrics Preview: Summary table of similarity scores
Heatmaps: Visual comparison across all chapter pairs (darker = more similar)
Word Counts: Bar chart showing segment lengths
Vocabulary Containment: Directional metric showing what % of one text's vocabulary is in another
CSV Download: Full results for further analysis

AI Interpretation (Optional)

After running analysis, click "Help Interpret Results" for scholarly insights:

Pattern identification across chapters
Notable textual relationships
Suggestions for further investigation

Embedding Model

Semantic similarity uses Hugging Face sentence-transformer models. The following models are available:

buddhist-nlp/buddhist-sentence-similarity (default, recommended): Developed by Dharmamitra, this model is specifically trained for sentence similarity on Buddhist texts in Tibetan, Buddhist Chinese, Sanskrit (IAST), and Pāli. Best choice for Tibetan Buddhist manuscripts.
buddhist-nlp/bod-eng-similarity: Also from Dharmamitra, optimized for Tibetan-English bitext alignment tasks.
sentence-transformers/LaBSE: General multilingual model, good baseline for non-Buddhist texts.
BAAI/bge-m3: Strong multilingual alternative with broad language coverage.

These models provide context-aware, segment-level embeddings suitable for comparing Tibetan text passages.

Structure

app.py — Gradio web app entry point and UI definition.
pipeline/ — Modules for file handling, text processing, metrics calculation, and visualization.
- process.py: Core logic for segmenting texts and orchestrating metric computation.
- metrics.py: Implementation of Jaccard, LCS, Fuzzy, and Semantic Similarity.
- hf_embedding.py: Handles loading and using sentence-transformer models.
- tokenize.py: Tibetan text tokenization using botok.
- normalize_bo.py: Tibetan particle normalization for grammatical variants.
- stopwords_bo.py: Comprehensive Tibetan stopword list with tsek normalization.
- visualize.py: Generates heatmaps and word count plots.
requirements.txt — Python dependencies for the web application.

License

This project is licensed under the Creative Commons Attribution 4.0 International License - see the LICENSE file in the main project directory for details.

Research and Acknowledgements

We acknowledge the broader Tibetan NLP community for tokenization and stopword resources leveraged in this project, including the Divergent Discourses stopword list and BUDA's lucene-bo analyzer.

Citation

If you use this web application or the underlying TTM tool in your research, please cite the main project:

@software{wojahn2025ttm,
  title = {TibetanTextMetrics (TTM): Computing Text Similarity Metrics on POS-tagged Tibetan Texts},
  author = {Daniel Wojahn},
  year = {2025},
  url = {https://github.com/daniel-wojahn/tibetan-text-metrics},
  version = {0.4.0}
}

For questions or issues specifically regarding the web application, please refer to the main project's issue tracker or contact Daniel Wojahn.