maslionok commited on
Commit
bb730a2
·
1 Parent(s): f475522
Files changed (1) hide show
  1. app.py +35 -29
app.py CHANGED
@@ -45,42 +45,48 @@ with gr.Blocks(title="Solr Normalization Demo") as demo:
45
 
46
  This demo showcases the **Solr Normalization Pipeline**, which replicates the text preprocessing steps applied by Solr <span title=\"Solr is the platform that provides search capabilities in Impresso. Several preprocessing steps must be undertaken to prepare data to be searchable in Solr. These steps are common in Natural Language Processing pipelines, as they help with normalising textual data by, for example, making the whole text lowercase. This makes possible non case-sensitive searches, where if you either write 'Dog' or 'dog', you can get the same results.\">ℹ️</span> during indexing to help you understand how raw input is transformed before becoming searchable.
47
 
48
- These steps are crucial for improving **search recall** and maintaining **linguistic consistency** across large, multilingual corpora.
49
-
50
  You can try the example below, or enter your own text to explore how it is normalized behind the scenes.
51
  """
52
  )
53
 
54
  with gr.Row():
55
  with gr.Column():
56
- with gr.Accordion("What is Solr?", open=False) as solr_info:
57
- gr.Markdown("""
58
- **Solr is the search engine platform used to power fast and flexible information retrieval.**
59
- It indexes large collections of text and allows users to query them efficiently, returning the most relevant results.
60
-
61
- Before data can be used in Solr, it must go through several **preprocessing and indexing steps**.
62
- These include tokenization (splitting text into words), lowercasing, stopword removal (e.g., ignoring common words like "the" or "and"), and stemming or lemmatization (reducing words to their root forms).
63
-
64
- Such steps are common in **Natural Language Processing (NLP)** pipelines, as they help standardize text and make search more robust.
65
- For example, thanks to normalization, a search for "running" can also match documents containing "run."
66
- Similarly, lowercasing ensures that "History" and "history" are treated as the same word, making searches case-insensitive.
67
- """)
68
- gr.Markdown("""
69
- 🧠 **Why is this useful?**
70
-
71
- - It explains why search results might not exactly match the words you entered.
72
- - It shows how different word forms are **collapsed** into searchable stems.
73
- - It helps interpret unexpected matches (or mismatches) when querying historical text collections.
74
- """)
75
  with gr.Column():
76
- with gr.Accordion("🧹 The pipeline applies", open=False):
77
- gr.Markdown("""
78
- The pipeline applies:
79
- - **Tokenization** (splitting text into searchable units)
80
- - **Stopword removal** (filtering out common, uninformative words)
81
- - **Lowercasing and normalization**
82
- - **Language-specific filters** (e.g., stemming, elision)
83
- """)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
 
86
  with gr.Row():
 
45
 
46
  This demo showcases the **Solr Normalization Pipeline**, which replicates the text preprocessing steps applied by Solr <span title=\"Solr is the platform that provides search capabilities in Impresso. Several preprocessing steps must be undertaken to prepare data to be searchable in Solr. These steps are common in Natural Language Processing pipelines, as they help with normalising textual data by, for example, making the whole text lowercase. This makes possible non case-sensitive searches, where if you either write 'Dog' or 'dog', you can get the same results.\">ℹ️</span> during indexing to help you understand how raw input is transformed before becoming searchable.
47
 
 
 
48
  You can try the example below, or enter your own text to explore how it is normalized behind the scenes.
49
  """
50
  )
51
 
52
  with gr.Row():
53
  with gr.Column():
54
+ with gr.Accordion("What is Solr?", open=False) as solr_info:
55
+ pass
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  with gr.Column():
57
+ with gr.Accordion("🧹 What pipeline applies?", open=False) as pipeline_info:
58
+ pass
59
+
60
+ # Place Markdown blocks as children of the accordions (not indented inside them)
61
+ with solr_info:
62
+ gr.Markdown("""
63
+ **Solr is the search engine platform used to power fast and flexible information retrieval.**
64
+ It indexes large collections of text and allows users to query them efficiently, returning the most relevant results.
65
+
66
+ Before data can be used in Solr, it must go through several **preprocessing and indexing steps**.
67
+ These include tokenization (splitting text into words), lowercasing, stopword removal (e.g., ignoring common words like "the" or "and"), and stemming or lemmatization (reducing words to their root forms).
68
+
69
+ Such steps are common in **Natural Language Processing (NLP)** pipelines, as they help standardize text and make search more robust.
70
+ For example, thanks to normalization, a search for "running" can also match documents containing "run."
71
+ Similarly, lowercasing ensures that "History" and "history" are treated as the same word, making searches case-insensitive.
72
+ """)
73
+ gr.Markdown("""
74
+ 🧠 **Why is this useful?**
75
+
76
+ - It explains why search results might not exactly match the words you entered.
77
+ - It shows how different word forms are **collapsed** into searchable stems.
78
+ - It helps interpret unexpected matches (or mismatches) when querying historical text collections.
79
+ """)
80
+ with pipeline_info:
81
+ gr.Markdown("""
82
+ The pipeline applies:
83
+ - **Tokenization** (splitting text into searchable units)
84
+ - **Stopword removal** (filtering out common, uninformative words)
85
+ - **Lowercasing and normalization**
86
+ - **Language-specific filters** (e.g., stemming, elision)
87
+
88
+ These steps are crucial for improving **search recall** and maintaining **linguistic consistency** across large, multilingual corpora.
89
+ """)
90
 
91
 
92
  with gr.Row():