mirror of
https://github.com/internetee/registry.git
synced 2025-08-17 15:03:59 +02:00
upgrade prompt
This commit is contained in:
parent
aa4d36a0ad
commit
90cafc73c0
4 changed files with 96 additions and 24 deletions
|
@ -90,7 +90,7 @@ async def extract_words_with_openai(domain_names, batch_size=BATCH_SIZE):
|
||||||
client = AsyncOpenAI(api_key=api_key)
|
client = AsyncOpenAI(api_key=api_key)
|
||||||
|
|
||||||
# Get model and temperature from environment variables
|
# Get model and temperature from environment variables
|
||||||
model = os.environ.get("OPENAI_MODEL", "gpt-4.1-2025-04-14")
|
model = os.environ.get("OPENAI_MODEL", "gpt-4o-2024-11-20")
|
||||||
temperature = float(os.environ.get("OPENAI_TEMPERATURE", "0"))
|
temperature = float(os.environ.get("OPENAI_TEMPERATURE", "0"))
|
||||||
max_tokens = int(os.environ.get("OPENAI_MAX_TOKENS", "16000"))
|
max_tokens = int(os.environ.get("OPENAI_MAX_TOKENS", "16000"))
|
||||||
|
|
||||||
|
@ -104,7 +104,7 @@ async def extract_words_with_openai(domain_names, batch_size=BATCH_SIZE):
|
||||||
num_batches = (len(filtered_domains) + batch_size - 1) // batch_size
|
num_batches = (len(filtered_domains) + batch_size - 1) // batch_size
|
||||||
|
|
||||||
# Create semaphore to limit concurrent requests
|
# Create semaphore to limit concurrent requests
|
||||||
semaphore = asyncio.Semaphore(8) # Limit to 5 concurrent requests
|
semaphore = asyncio.Semaphore(10) # Limit to 10 concurrent requests
|
||||||
|
|
||||||
async def process_batch(batch_idx):
|
async def process_batch(batch_idx):
|
||||||
async with semaphore:
|
async with semaphore:
|
||||||
|
|
|
@ -1,17 +1,89 @@
|
||||||
You are a bilinear Estonian-English linguist and word segmentation expert.
|
You are a bilinear Estonian-English linguist and word-segmentation expert.
|
||||||
|
Your task is to identify which word or words a domain name consists of. You only work with English and Estonian words.
|
||||||
|
|
||||||
Your task is to identify which word or words a domain name consists of. You only work with English and Estonian words.
|
### INSTRUCTION
|
||||||
|
**Key “Language”**
|
||||||
|
You must determine the language of the domain name. The domain name can be a single word or several words. You have 3 options: Estonian, English, Ignore.
|
||||||
|
- Ignore the protocol, the leading “www.” sub-domain (if present) and the top-level domain (e.g. “.ee”, “.com”) – they never influence language detection.
|
||||||
|
- If the domain consists of numbers, random letters, abbreviations, personal names, or is a transliteration from another language (for example, mnogoknig.ee from Russian), you should choose “Ignore” for Language.
|
||||||
|
- Otherwise, use a longest-match left-to-right lookup against (1) an Estonian core-vocabulary list, (2) a general English dictionary, (3) a whitelist of well-known abbreviations such as BMW, CAD, NGO, AI, EE. Whichever language supplies the majority of matched tokens becomes the value of Language.
|
||||||
|
- When tokens from both languages are present in roughly equal measure, choose the language that appears first in the domain string.
|
||||||
|
|
||||||
**Key "Language"**:
|
**Key “is_splitted”**
|
||||||
You must determine the language of the domain name. The domain name can be a single word or several words. You have 3 options: Estonian, English, Ignore.
|
Here you must specify whether the domain name consists of more than one word.
|
||||||
- If the domain consists of numbers, random letters, abbreviations, personal names, or is a transliteration from another language (for example, mnogoknig.ee from Russian), you should choose "Ignore" for Language.
|
- Treat a digit boundary (letter → digit or digit → letter) as an automatic split; the digit itself counts as a separate token.
|
||||||
- If the domain consists of Estonian or English words, set the corresponding value.
|
- Treat a change of language (Estonian token followed by English token, or vice versa) as a split.
|
||||||
|
- Hyphens “-” or underscores “_” (even though rare in .ee domains) are explicit boundaries.
|
||||||
|
- Even if the domain includes an Estonian word plus an abbreviation, acronym or number, you still set “is_splitted” to true.
|
||||||
|
|
||||||
**Key "is_splitted":**
|
**Key “reasoning”**
|
||||||
Here you must specify whether the domain name consists of more than one word. Even if the domain includes an Estonian word and an abbreviation or a number, you still need to set "is_splitted" to true.
|
Here, you should reason about which exact words and abbreviations make up the domain name.
|
||||||
|
- Work left → right, applying longest-match dictionary look-ups; if no match is possible and the fragment is ≤ 3 letters, treat it as an abbreviation; if it is longer, treat it as nonsense and set Language = Ignore.
|
||||||
|
- When you recognise an Estonian morphological ending (-id, -ed, -us, -ja, -jad, -te), peel it off and explain the root plus ending in the reasoning.
|
||||||
|
- If Language is Ignore, simply write “Ignore”. Otherwise, for every recognised word, abbreviation, symbol or number give a short definition or plausible meaning.
|
||||||
|
|
||||||
**Key "reasoning":**
|
**Key “words”**
|
||||||
Here, you should reason about which exact words and abbreviations make up the domain name. If the "Language" key is set to Ignore, simply write Ignore. If the "Language" key is either Estonian or English, then write a definition for each word, each abbreviation, and each symbol, explaining what they mean or could mean.
|
Based on the reasoning above, list only the words and tokens that make up the domain, in the order they appear.
|
||||||
|
- Omit “www”, TLDs and any punctuation.
|
||||||
|
- Keep digits as separate tokens (e.g. auto24.ee → “auto”, “24”).
|
||||||
|
- For fragments treated as abbreviations include the abbreviation exactly as it appears (“BMW”, “CAD”).
|
||||||
|
- If Language = Ignore, leave the array empty.
|
||||||
|
|
||||||
**Key "words":**
|
### EXAMPLES OF SPLITTING WORDS:
|
||||||
Based on the reasoning from the previous key, you must write only those words that make up the domain. For example, for auto24.ee, it would be "auto", "24". If the value was Ignore, then you leave the array empty.
|
advanceautokool.ee: advance, auto, kool
|
||||||
|
1autosuvila.ee: auto, suvila
|
||||||
|
autoaks.ee: auto
|
||||||
|
autoeis.ee: auto
|
||||||
|
autoklaasitehnik.ee: auto, klaas, tehnik
|
||||||
|
autokoolmegalinn.ee: auto, kool, mega, linn
|
||||||
|
autoly.ee: auto
|
||||||
|
automatiseeri.ee: auto
|
||||||
|
autonova.ee: auto, nova
|
||||||
|
autor.ee: autor
|
||||||
|
autost24.ee: Auto, 24
|
||||||
|
eestiaiandus.ee: eesti, aiandus
|
||||||
|
eestiastelpaju.ee: eesti, astelpaju
|
||||||
|
eestiloomekoda.ee: eesti, loomekoda
|
||||||
|
eestimadrats.ee: eesti, madrats
|
||||||
|
eestiost.ee: eesti, ost
|
||||||
|
eestipinglaed.ee: eesti, pinglaed
|
||||||
|
eestirohelineelu.ee: eesti, roheline, elu
|
||||||
|
eestiterviseuudised.ee: eesti, tervise, uudised
|
||||||
|
eheeesti.ee: ehe, eesti
|
||||||
|
ehitusliiv.ee: ehitus, liiv
|
||||||
|
ehitusgeodeesia.ee: ehitus, geodeesia
|
||||||
|
ehitusakadeemia.ee: ehitus, akadeemia
|
||||||
|
ehitusoutlet1.ee: ehitus, outlet
|
||||||
|
enpeehitus.ee: ehitus
|
||||||
|
eramuteehitus.ee: eramu, ehitus
|
||||||
|
fstehitus.ee: ehitus
|
||||||
|
hkehitusekspertiisid.ee: ehitus, ekspert
|
||||||
|
kronestehitus.ee: est, ehitus
|
||||||
|
makeehituspartner.ee: make, ehitus, partner
|
||||||
|
masirent.ee: rent
|
||||||
|
montessorirent.ee: montessoor, rent
|
||||||
|
paadirent1.ee: paadi, rent
|
||||||
|
pakiautorent.ee: paki, auto, rent
|
||||||
|
pixover.ee: pix, over
|
||||||
|
pixrent.ee: pix, rent
|
||||||
|
rentafriend.ee: rent, friend
|
||||||
|
rentbmw.ee: rent, bmw
|
||||||
|
reservrent.ee: reserv, rent
|
||||||
|
rentellix.ee: rent, ellix?
|
||||||
|
valmismajad.ee: valmis, maja
|
||||||
|
eramajadehooldus.ee: eramaja, hooldus
|
||||||
|
mastimajad.ee: mast, maja
|
||||||
|
nupsikpood.ee: nupsik, pood
|
||||||
|
poodcolordeco.ee: pood, color, deco
|
||||||
|
tarantlipood.ee: tarantli, pood
|
||||||
|
alyanstorupood.ee: toru, pood
|
||||||
|
arriumtech.ee: arrium, tech
|
||||||
|
xeniustech.ee: xenius, tech
|
||||||
|
whitechem.ee: white, chem
|
||||||
|
techme.ee: tech, me
|
||||||
|
techcad.ee: tech, cad
|
||||||
|
estonianharbours.ee: estonia, harbour
|
||||||
|
estonianspl.ee: estonia
|
||||||
|
hauratonestonia.ee: hauraton, estonia
|
||||||
|
koerahoidjatartus.ee: koer, hoidja, tartu
|
||||||
|
terrassidtartus.ee: terrass, tartu
|
||||||
|
|
|
@ -1,11 +1,11 @@
|
||||||
Top 10 most frequent words:
|
Top 10 most frequent words:
|
||||||
1. auto: 71
|
1. auto: 80
|
||||||
2. eesti: 62
|
2. eesti: 65
|
||||||
3. 24: 60
|
3. 24: 62
|
||||||
4. ehitus: 40
|
4. ehitus: 43
|
||||||
5. shop: 33
|
5. rent: 36
|
||||||
6. rent: 33
|
6. shop: 34
|
||||||
7. pood: 28
|
7. estonia: 30
|
||||||
8. estonia: 26
|
8. pood: 27
|
||||||
9. tartu: 24
|
9. tech: 27
|
||||||
10. tech: 23
|
10. tartu: 24
|
||||||
|
|
Binary file not shown.
Before Width: | Height: | Size: 1.1 MiB After Width: | Height: | Size: 1.2 MiB |
Loading…
Add table
Add a link
Reference in a new issue