upgrade prompt

This commit is contained in:
oleghasjanov 2025-05-29 15:42:12 +03:00
parent aa4d36a0ad
commit 90cafc73c0
4 changed files with 96 additions and 24 deletions

View file

@ -90,7 +90,7 @@ async def extract_words_with_openai(domain_names, batch_size=BATCH_SIZE):
client = AsyncOpenAI(api_key=api_key) client = AsyncOpenAI(api_key=api_key)
# Get model and temperature from environment variables # Get model and temperature from environment variables
model = os.environ.get("OPENAI_MODEL", "gpt-4.1-2025-04-14") model = os.environ.get("OPENAI_MODEL", "gpt-4o-2024-11-20")
temperature = float(os.environ.get("OPENAI_TEMPERATURE", "0")) temperature = float(os.environ.get("OPENAI_TEMPERATURE", "0"))
max_tokens = int(os.environ.get("OPENAI_MAX_TOKENS", "16000")) max_tokens = int(os.environ.get("OPENAI_MAX_TOKENS", "16000"))
@ -104,7 +104,7 @@ async def extract_words_with_openai(domain_names, batch_size=BATCH_SIZE):
num_batches = (len(filtered_domains) + batch_size - 1) // batch_size num_batches = (len(filtered_domains) + batch_size - 1) // batch_size
# Create semaphore to limit concurrent requests # Create semaphore to limit concurrent requests
semaphore = asyncio.Semaphore(8) # Limit to 5 concurrent requests semaphore = asyncio.Semaphore(10) # Limit to 10 concurrent requests
async def process_batch(batch_idx): async def process_batch(batch_idx):
async with semaphore: async with semaphore:

View file

@ -1,17 +1,89 @@
You are a bilinear Estonian-English linguist and word segmentation expert. You are a bilinear Estonian-English linguist and word-segmentation expert.
Your task is to identify which word or words a domain name consists of. You only work with English and Estonian words.
Your task is to identify which word or words a domain name consists of. You only work with English and Estonian words. ### INSTRUCTION
**Key “Language”**
You must determine the language of the domain name. The domain name can be a single word or several words. You have 3 options: Estonian, English, Ignore.
- Ignore the protocol, the leading “www.” sub-domain (if present) and the top-level domain (e.g. “.ee”, “.com”) they never influence language detection.
- If the domain consists of numbers, random letters, abbreviations, personal names, or is a transliteration from another language (for example, mnogoknig.ee from Russian), you should choose “Ignore” for Language.
- Otherwise, use a longest-match left-to-right lookup against (1) an Estonian core-vocabulary list, (2) a general English dictionary, (3) a whitelist of well-known abbreviations such as BMW, CAD, NGO, AI, EE. Whichever language supplies the majority of matched tokens becomes the value of Language.
- When tokens from both languages are present in roughly equal measure, choose the language that appears first in the domain string.
**Key "Language"**: **Key “is_splitted”**
You must determine the language of the domain name. The domain name can be a single word or several words. You have 3 options: Estonian, English, Ignore. Here you must specify whether the domain name consists of more than one word.
- If the domain consists of numbers, random letters, abbreviations, personal names, or is a transliteration from another language (for example, mnogoknig.ee from Russian), you should choose "Ignore" for Language. - Treat a digit boundary (letter → digit or digit → letter) as an automatic split; the digit itself counts as a separate token.
- If the domain consists of Estonian or English words, set the corresponding value. - Treat a change of language (Estonian token followed by English token, or vice versa) as a split.
- Hyphens “-” or underscores “_” (even though rare in .ee domains) are explicit boundaries.
- Even if the domain includes an Estonian word plus an abbreviation, acronym or number, you still set “is_splitted” to true.
**Key "is_splitted":** **Key “reasoning”**
Here you must specify whether the domain name consists of more than one word. Even if the domain includes an Estonian word and an abbreviation or a number, you still need to set "is_splitted" to true. Here, you should reason about which exact words and abbreviations make up the domain name.
- Work left → right, applying longest-match dictionary look-ups; if no match is possible and the fragment is ≤ 3 letters, treat it as an abbreviation; if it is longer, treat it as nonsense and set Language = Ignore.
- When you recognise an Estonian morphological ending (-id, -ed, -us, -ja, -jad, -te), peel it off and explain the root plus ending in the reasoning.
- If Language is Ignore, simply write “Ignore”. Otherwise, for every recognised word, abbreviation, symbol or number give a short definition or plausible meaning.
**Key "reasoning":** **Key “words”**
Here, you should reason about which exact words and abbreviations make up the domain name. If the "Language" key is set to Ignore, simply write Ignore. If the "Language" key is either Estonian or English, then write a definition for each word, each abbreviation, and each symbol, explaining what they mean or could mean. Based on the reasoning above, list only the words and tokens that make up the domain, in the order they appear.
- Omit “www”, TLDs and any punctuation.
- Keep digits as separate tokens (e.g. auto24.ee → “auto”, “24”).
- For fragments treated as abbreviations include the abbreviation exactly as it appears (“BMW”, “CAD”).
- If Language = Ignore, leave the array empty.
**Key "words":** ### EXAMPLES OF SPLITTING WORDS:
Based on the reasoning from the previous key, you must write only those words that make up the domain. For example, for auto24.ee, it would be "auto", "24". If the value was Ignore, then you leave the array empty. advanceautokool.ee: advance, auto, kool
1autosuvila.ee: auto, suvila
autoaks.ee: auto
autoeis.ee: auto
autoklaasitehnik.ee: auto, klaas, tehnik
autokoolmegalinn.ee: auto, kool, mega, linn
autoly.ee: auto
automatiseeri.ee: auto
autonova.ee: auto, nova
autor.ee: autor
autost24.ee: Auto, 24
eestiaiandus.ee: eesti, aiandus
eestiastelpaju.ee: eesti, astelpaju
eestiloomekoda.ee: eesti, loomekoda
eestimadrats.ee: eesti, madrats
eestiost.ee: eesti, ost
eestipinglaed.ee: eesti, pinglaed
eestirohelineelu.ee: eesti, roheline, elu
eestiterviseuudised.ee: eesti, tervise, uudised
eheeesti.ee: ehe, eesti
ehitusliiv.ee: ehitus, liiv
ehitusgeodeesia.ee: ehitus, geodeesia
ehitusakadeemia.ee: ehitus, akadeemia
ehitusoutlet1.ee: ehitus, outlet
enpeehitus.ee: ehitus
eramuteehitus.ee: eramu, ehitus
fstehitus.ee: ehitus
hkehitusekspertiisid.ee: ehitus, ekspert
kronestehitus.ee: est, ehitus
makeehituspartner.ee: make, ehitus, partner
masirent.ee: rent
montessorirent.ee: montessoor, rent
paadirent1.ee: paadi, rent
pakiautorent.ee: paki, auto, rent
pixover.ee: pix, over
pixrent.ee: pix, rent
rentafriend.ee: rent, friend
rentbmw.ee: rent, bmw
reservrent.ee: reserv, rent
rentellix.ee: rent, ellix?
valmismajad.ee: valmis, maja
eramajadehooldus.ee: eramaja, hooldus
mastimajad.ee: mast, maja
nupsikpood.ee: nupsik, pood
poodcolordeco.ee: pood, color, deco
tarantlipood.ee: tarantli, pood
alyanstorupood.ee: toru, pood
arriumtech.ee: arrium, tech
xeniustech.ee: xenius, tech
whitechem.ee: white, chem
techme.ee: tech, me
techcad.ee: tech, cad
estonianharbours.ee: estonia, harbour
estonianspl.ee: estonia
hauratonestonia.ee: hauraton, estonia
koerahoidjatartus.ee: koer, hoidja, tartu
terrassidtartus.ee: terrass, tartu

View file

@ -1,11 +1,11 @@
Top 10 most frequent words: Top 10 most frequent words:
1. auto: 71 1. auto: 80
2. eesti: 62 2. eesti: 65
3. 24: 60 3. 24: 62
4. ehitus: 40 4. ehitus: 43
5. shop: 33 5. rent: 36
6. rent: 33 6. shop: 34
7. pood: 28 7. estonia: 30
8. estonia: 26 8. pood: 27
9. tartu: 24 9. tech: 27
10. tech: 23 10. tartu: 24

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.1 MiB

After

Width:  |  Height:  |  Size: 1.2 MiB

Before After
Before After