|
8601
|
167
|
8
|
2026-04-14T06:55:28.298806+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149728298_m2.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_2
|
NULL
|
NULL
|
NULL
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesToolsWi FirefoxFileEoitViewHistoryBookmarksProfilesToolsWindow Helpdoo.sidemne.minny.com/dl-teoorls~ Google Gemini.. X•< 40 Wl © | Daily -Platform • 10mleft A 100%C &• Tue 14 Apr 9:55:27Platform Sprint 1 Q2 - Platform Tea@ Jy 19798 evaluation for ai activity t8 Jiminny+ New labsocc ce Cdcseciie3. Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringoeses.Al Reports• Report nameNAME -Ask Jiminny Test Report - 8 Apr 2026S03 Ask Jiminny reportsPeriodiReport Type© Clear all |10FREQUENCY+DailySHAREDICCanDATEY09/04/9016ACTIONSP Croare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await sasession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio. Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiProvYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne about peodleYour privacy & GeminiSummarize pageGapp.staging.jiminny.com/a...
|
NULL
|
3214708001850927741
|
NULL
|
visual_change
|
ocr
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesToolsWi FirefoxFileEoitViewHistoryBookmarksProfilesToolsWindow Helpdoo.sidemne.minny.com/dl-teoorls~ Google Gemini.. X•< 40 Wl © | Daily -Platform • 10mleft A 100%C &• Tue 14 Apr 9:55:27Platform Sprint 1 Q2 - Platform Tea@ Jy 19798 evaluation for ai activity t8 Jiminny+ New labsocc ce Cdcseciie3. Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringoeses.Al Reports• Report nameNAME -Ask Jiminny Test Report - 8 Apr 2026S03 Ask Jiminny reportsPeriodiReport Type© Clear all |10FREQUENCY+DailySHAREDICCanDATEY09/04/9016ACTIONSP Croare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await sasession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio. Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiProvYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne about peodleYour privacy & GeminiSummarize pageGapp.staging.jiminny.com/a...
|
NULL
|
|
8600
|
167
|
7
|
2026-04-14T06:55:22.230352+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149722230_m2.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_2
|
NULL
|
NULL
|
NULL
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesToolsWi FirefoxFileEoitViewHistoryBookmarksProfilesToolsWindow Helpdoo.sidemne.minny.com/dl-teoorls~ Google Gemini.. XC< 40Wl © | Daily - Platform • 10mleft A 100%C &• Tue 14 Apr 9:55:21Platform Sprint 1 Q2 - Platform Tea@ Jy 19798 evaluation for ai activity t8 Jiminny+ New labsocc ce Cdcseciie3. Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringoeses.Al Reports• Report nameNAME -Ask Jiminny Test Report - 8 Apr 20263 Ask Jiminny reportsPeriodiReport Type10© Clear allFREQUENCYDailySHAREDLKamDATEY09/04/9016ACTIONS• Co dare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await sasession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio. Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiPro vYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize pageJY-18909-automated-reports-ask-iiminny = 869453)...
|
NULL
|
-4633704651421981136
|
NULL
|
visual_change
|
ocr
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesToolsWi FirefoxFileEoitViewHistoryBookmarksProfilesToolsWindow Helpdoo.sidemne.minny.com/dl-teoorls~ Google Gemini.. XC< 40Wl © | Daily - Platform • 10mleft A 100%C &• Tue 14 Apr 9:55:21Platform Sprint 1 Q2 - Platform Tea@ Jy 19798 evaluation for ai activity t8 Jiminny+ New labsocc ce Cdcseciie3. Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringoeses.Al Reports• Report nameNAME -Ask Jiminny Test Report - 8 Apr 20263 Ask Jiminny reportsPeriodiReport Type10© Clear allFREQUENCYDailySHAREDLKamDATEY09/04/9016ACTIONS• Co dare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await sasession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio. Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiPro vYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize pageJY-18909-automated-reports-ask-iiminny = 869453)...
|
8599
|
|
8599
|
167
|
6
|
2026-04-14T06:55:19.229887+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149719229_m2.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_2
|
NULL
|
NULL
|
NULL
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesToolsWi FirefoxFileEoitViewHistoryBookmarksProfilesToolsWindow Helpdoo.sidemne.minny.com/dl-teoorls~ Google Gemini.. X•< 40 ll © | Daily - Platform • 10m leftA 100%C &• Tue 14 Apr 9:55:18Platform Sprint 1 Q2 - Platform TeaJy 19798 evaluation for ai activity t•Jiminny+ New labAl Reports• Report nameNAME €S03 Ask Jiminny reportsPeriodiReport Type© Clear all |socc ce Cdcseciie3.Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringolcses.FREQUENCY #DATE €ACTIONSI10are there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await s asession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio. Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiProvYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize pageapp.staging.fiminnv.com/ai-rgports k-iiminny - 869453...
|
NULL
|
1743723797016129039
|
NULL
|
visual_change
|
ocr
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesToolsWi FirefoxFileEoitViewHistoryBookmarksProfilesToolsWindow Helpdoo.sidemne.minny.com/dl-teoorls~ Google Gemini.. X•< 40 ll © | Daily - Platform • 10m leftA 100%C &• Tue 14 Apr 9:55:18Platform Sprint 1 Q2 - Platform TeaJy 19798 evaluation for ai activity t•Jiminny+ New labAl Reports• Report nameNAME €S03 Ask Jiminny reportsPeriodiReport Type© Clear all |socc ce Cdcseciie3.Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringolcses.FREQUENCY #DATE €ACTIONSI10are there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await s asession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio. Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiProvYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize pageapp.staging.fiminnv.com/ai-rgports k-iiminny - 869453...
|
NULL
|
|
8598
|
167
|
5
|
2026-04-14T06:55:17.879180+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149717879_m2.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_2
|
NULL
|
NULL
|
NULL
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesWindowH FirefoxFileEoitViewHistoryBookmarksProfilesWindowHelpPlatform Sprint 1 Q2 - Platform TeaJy 19798 evaluation for ai activity t8 Jiminny+ New labToolsdoo.sideme.minny.com/casnooara~ Google Gemini•. Xsocc ce Cdcseciie3.Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringOesesMy Recordings Everyone's RecordingsTrending this monthSortby: Most played< 40 ll 1 Daily - Platform • 10m leftA100% C• 8 • Tue 14 Apr 9:55:17Live FeedD10Unknown Customer MNotetaker added by Veselin Kulovvinmes roedNo RecordingsScheduleThis WeekEveryone's Scheduleare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await sasession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio.Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiPro vYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize pageAPUnknown CustomerRefinement - ProcessingWednesday, 12:30 AMUnknown CustomerProcessing tickets reviewWednesday, 4:30 AMRobinson Crusoe Cruises Limited ®Sprint ReviewClosed LostWednesday, 6:00 AMInvite Notetaker::Veselin Kulov listened to call 60activity with unknown customerf Held: 7 Apr, 2:36 PMỠ Duration: 7mNikolay Nikolov listened to call 60Web Demo with Martin Petkovđ Held: 13 Feb, 5:15 AM• Duration: 11mlNikolay Yankov listened to call 6дactivity with Nikolay Yankov8 Held: 12 Feb. 12:42 AMỞ Duration: 1mNikolay Yankov listened to call 6aactivity with Nikolay YankovHeld: 12 Feb, 12:42 AMỞ Duration: 1mMartin Petkov listened to call 6dDiscovery with Martin PetkovHeld: 21 Jan, 7:00 AMỞ Duration: 35mMartin Petkov listened to call 6dactivity with Jonathan HeatonE Held: 25 Sep, 2024, 3:33 AMỞ Duration: 37m7 Apr, 2:51 PMlolo lo lo lo loGalya Dimitrova listened to call 6д26 Nov, 2025, 4:29 AMactivity with David EhrlichE Held: 9 Oct, 2024, 5:31 AM• Duration: 39m≥ Value: $23,000Aneliya Angelova listened to call 6д18 Aug, 2025, 10:45 AMCold call with Robinson Crusoe New 13333f Held: 23 Jul, 2025, 7:43 AM• Duration: 2ml§ Value: $5Jiminny Web SA shared in Slack &23 Jul, 2025, 7:48 AMCold call with Robinson Crusoe New 13333E Held: 23 Jul, 2025, 7:43 AM|Duration: 2m≤ Value: $5JY-18909-automated-reports-ask-jiminny = 869453)...
|
NULL
|
4604028648177887887
|
NULL
|
click
|
ocr
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesWindowH FirefoxFileEoitViewHistoryBookmarksProfilesWindowHelpPlatform Sprint 1 Q2 - Platform TeaJy 19798 evaluation for ai activity t8 Jiminny+ New labToolsdoo.sideme.minny.com/casnooara~ Google Gemini•. Xsocc ce Cdcseciie3.Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringOesesMy Recordings Everyone's RecordingsTrending this monthSortby: Most played< 40 ll 1 Daily - Platform • 10m leftA100% C• 8 • Tue 14 Apr 9:55:17Live FeedD10Unknown Customer MNotetaker added by Veselin Kulovvinmes roedNo RecordingsScheduleThis WeekEveryone's Scheduleare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await sasession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio.Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiPro vYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize pageAPUnknown CustomerRefinement - ProcessingWednesday, 12:30 AMUnknown CustomerProcessing tickets reviewWednesday, 4:30 AMRobinson Crusoe Cruises Limited ®Sprint ReviewClosed LostWednesday, 6:00 AMInvite Notetaker::Veselin Kulov listened to call 60activity with unknown customerf Held: 7 Apr, 2:36 PMỠ Duration: 7mNikolay Nikolov listened to call 60Web Demo with Martin Petkovđ Held: 13 Feb, 5:15 AM• Duration: 11mlNikolay Yankov listened to call 6дactivity with Nikolay Yankov8 Held: 12 Feb. 12:42 AMỞ Duration: 1mNikolay Yankov listened to call 6aactivity with Nikolay YankovHeld: 12 Feb, 12:42 AMỞ Duration: 1mMartin Petkov listened to call 6dDiscovery with Martin PetkovHeld: 21 Jan, 7:00 AMỞ Duration: 35mMartin Petkov listened to call 6dactivity with Jonathan HeatonE Held: 25 Sep, 2024, 3:33 AMỞ Duration: 37m7 Apr, 2:51 PMlolo lo lo lo loGalya Dimitrova listened to call 6д26 Nov, 2025, 4:29 AMactivity with David EhrlichE Held: 9 Oct, 2024, 5:31 AM• Duration: 39m≥ Value: $23,000Aneliya Angelova listened to call 6д18 Aug, 2025, 10:45 AMCold call with Robinson Crusoe New 13333f Held: 23 Jul, 2025, 7:43 AM• Duration: 2ml§ Value: $5Jiminny Web SA shared in Slack &23 Jul, 2025, 7:48 AMCold call with Robinson Crusoe New 13333E Held: 23 Jul, 2025, 7:43 AM|Duration: 2m≤ Value: $5JY-18909-automated-reports-ask-jiminny = 869453)...
|
8596
|
|
8596
|
167
|
4
|
2026-04-14T06:55:16.036688+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149716036_m2.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_2
|
NULL
|
NULL
|
NULL
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesWindowH FirefoxFileEoitViewHistoryBookmarksProfilesWindowHelpPlatform Sprint 1 Q2 - Platform TeaJy 19798 evaluation for ai activity t*Jiminny+ New labToolsdoo.sideme.minny.com/casnooara~ Google Gemini•. Xsoccce cdcselnde.3.Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringOeses< 40 ll1 Daily - Platform • 10m left100% CD8 • Tue 14 Apr 9:55:15My RecordingsEveryone's RecordingsTrending this monthLive FeedSNo RecordingsScheduleThis WeekEveryone's Scheduleare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await s asession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio.Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiPro vYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize pageUnknown CustomerRefinement - Processing|Wednesday, 12:30 AM|Uпknown customerProcessing tickets reviewWednesday, 4:30 AM|Robinson frusoe Cruises Limited ®Sprint RevievClosed LostWednesday, 6:00 AM|Veselin Kulov listened to call 6d/ADr. 2.51 PMIactivity with unknown customerf Held: 7 Apr, 2:36 PM|•) Duration: 7ml£ Value: $0Nikolay Nikolov listened to call 6d27 Feb, 7:38 AMWeb Demo with Martin PetkovHeld: 13 Feb, 5:15 AM© Duration: 11ml§ Value: $11Nikolay Yankov listened to call 6d12 Feb, 1:07 AMacuivicy wicniNikolay тапкоvHe Held: T7 Feh. "7:47 AMI@ Duration: 1ml≥ Value: $0Nikolay Yankov listened to call 6d12 Feb, 1:06 AMaccivity with Nikolav YankovIEB Held: 17 Feb. 17:42 AMI© Duration: 1m l≥ Value: $01Martin Petkov listened to call 6d22 Jan, 4:30 AM|Discovery with Martin Petkov|F Held: 21 Jan, 7:00 AM|Ô Duration: 35m≥ Value: $21Martin Petkov listened to call 6dZU Jan. 24ZAMIactivity with Jonathan HeatonE Held: 25 Sep, 2024, 3:33 AM© Duration: 37ml& Value: £8,886,Galya Dimitrova listened to call 6dZ6 Nov. 2025,4:29 AMIactivity with David EhrlichE Held: 9 Oct, 2024, 5:31 AM|• Duration: 39m≥ Value: $23,000)Aneliya Angelova listened to call 6д18 Auo. 7075. 10'45 AMICold call with Robinson Crusoe New 13333f Held: 23 Jul, 2025, 7:43 AM|© Duration: 2m& Value: $5,Jiminny Web SA shared in Slack 2Cold call with Robinson Crusoe New 13333|Held: 23 Jul, 2025, 7:43 AM@ Duration: 2mllia.app.staging.jiminny.com…....
|
NULL
|
7599185317630420426
|
NULL
|
visual_change
|
ocr
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesWindowH FirefoxFileEoitViewHistoryBookmarksProfilesWindowHelpPlatform Sprint 1 Q2 - Platform TeaJy 19798 evaluation for ai activity t*Jiminny+ New labToolsdoo.sideme.minny.com/casnooara~ Google Gemini•. Xsoccce cdcselnde.3.Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringOeses< 40 ll1 Daily - Platform • 10m left100% CD8 • Tue 14 Apr 9:55:15My RecordingsEveryone's RecordingsTrending this monthLive FeedSNo RecordingsScheduleThis WeekEveryone's Scheduleare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await s asession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio.Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiPro vYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize pageUnknown CustomerRefinement - Processing|Wednesday, 12:30 AM|Uпknown customerProcessing tickets reviewWednesday, 4:30 AM|Robinson frusoe Cruises Limited ®Sprint RevievClosed LostWednesday, 6:00 AM|Veselin Kulov listened to call 6d/ADr. 2.51 PMIactivity with unknown customerf Held: 7 Apr, 2:36 PM|•) Duration: 7ml£ Value: $0Nikolay Nikolov listened to call 6d27 Feb, 7:38 AMWeb Demo with Martin PetkovHeld: 13 Feb, 5:15 AM© Duration: 11ml§ Value: $11Nikolay Yankov listened to call 6d12 Feb, 1:07 AMacuivicy wicniNikolay тапкоvHe Held: T7 Feh. "7:47 AMI@ Duration: 1ml≥ Value: $0Nikolay Yankov listened to call 6d12 Feb, 1:06 AMaccivity with Nikolav YankovIEB Held: 17 Feb. 17:42 AMI© Duration: 1m l≥ Value: $01Martin Petkov listened to call 6d22 Jan, 4:30 AM|Discovery with Martin Petkov|F Held: 21 Jan, 7:00 AM|Ô Duration: 35m≥ Value: $21Martin Petkov listened to call 6dZU Jan. 24ZAMIactivity with Jonathan HeatonE Held: 25 Sep, 2024, 3:33 AM© Duration: 37ml& Value: £8,886,Galya Dimitrova listened to call 6dZ6 Nov. 2025,4:29 AMIactivity with David EhrlichE Held: 9 Oct, 2024, 5:31 AM|• Duration: 39m≥ Value: $23,000)Aneliya Angelova listened to call 6д18 Auo. 7075. 10'45 AMICold call with Robinson Crusoe New 13333f Held: 23 Jul, 2025, 7:43 AM|© Duration: 2m& Value: $5,Jiminny Web SA shared in Slack 2Cold call with Robinson Crusoe New 13333|Held: 23 Jul, 2025, 7:43 AM@ Duration: 2mllia.app.staging.jiminny.com…....
|
NULL
|
|
8595
|
167
|
3
|
2026-04-14T06:55:13.024060+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149713024_m2.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_2
|
NULL
|
NULL
|
NULL
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesToolsWi FirefoxFileEoitViewHistoryBookmarksProfilesToolsWindow Helpdoo.siaemne.minny.com~ Google Gemini.. XPlatform Sprint 1 Q2 - Platform TeaJy 19798 evaluation for ai activity tJiminny+ New labsocc ce Cdcseciie3.Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringoesesare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await s asession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio.Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiProvYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize pageJIMINNY•< 40 ll © | Daily - Platform • 10m leftA 100%C &• Tue 14 Apr 9:55:12Sign InSign in with GoogleSign in with Office 365Login via SSOEUWorldwideDon't have account yet? Speak to your sales leader to get an invite.By using this service you agree to the Privacy Policy and Terms of Service.TERMSKNOWLEDGE BASE BLOG...
|
NULL
|
47346233358813018
|
NULL
|
visual_change
|
ocr
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesToolsWi FirefoxFileEoitViewHistoryBookmarksProfilesToolsWindow Helpdoo.siaemne.minny.com~ Google Gemini.. XPlatform Sprint 1 Q2 - Platform TeaJy 19798 evaluation for ai activity tJiminny+ New labsocc ce Cdcseciie3.Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringoesesare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await s asession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio.Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiProvYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize pageJIMINNY•< 40 ll © | Daily - Platform • 10m leftA 100%C &• Tue 14 Apr 9:55:12Sign InSign in with GoogleSign in with Office 365Login via SSOEUWorldwideDon't have account yet? Speak to your sales leader to get an invite.By using this service you agree to the Privacy Policy and Terms of Service.TERMSKNOWLEDGE BASE BLOG...
|
8594
|
|
8594
|
167
|
2
|
2026-04-14T06:55:10.912813+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149710912_m2.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_2
|
NULL
|
NULL
|
NULL
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesToolsWi FirefoxFileEoitViewHistoryBookmarksProfilesToolsWindow Helpdoo.siaemne.minny.com~ Google Gemini.. XPlatform Sprint 1 Q2 - Platform TeaJy 19798 evaluation for ai activity t• Jiminny+ New labsocc ce Cdcseciie3.Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringoesesare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await s asession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio.Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiProvYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize page•< 40 ll © | Daily - Platform • 10m leftA 100%C &• Tue 14 Apr 9:55:10Sign InSign in with ffoogleSign in with Office 365Login via SSOEUWorldwideDon't have account yet? Speak to your sales leader to get an invite.By using this service you agree to the Privacy Policy and Terms of Service.TERMS/KNOWLEDGE BASE BLOG...
|
NULL
|
-1239825134536860121
|
NULL
|
click
|
ocr
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesToolsWi FirefoxFileEoitViewHistoryBookmarksProfilesToolsWindow Helpdoo.siaemne.minny.com~ Google Gemini.. XPlatform Sprint 1 Q2 - Platform TeaJy 19798 evaluation for ai activity t• Jiminny+ New labsocc ce Cdcseciie3.Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringoesesare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await s asession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio.Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiProvYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize page•< 40 ll © | Daily - Platform • 10m leftA 100%C &• Tue 14 Apr 9:55:10Sign InSign in with ffoogleSign in with Office 365Login via SSOEUWorldwideDon't have account yet? Speak to your sales leader to get an invite.By using this service you agree to the Privacy Policy and Terms of Service.TERMS/KNOWLEDGE BASE BLOG...
|
NULL
|
|
8592
|
167
|
1
|
2026-04-14T06:55:09.977770+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149709977_m2.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_2
|
NULL
|
NULL
|
NULL
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesToolsWi FirefoxFileEoitViewHistoryBookmarksProfilesToolsWindow Helpdoo.siaemne.minny.com~ Google Gemini.. X•< 40 ll © | Daily - Platform • 10m leftA 100%C &• Tue 14 Apr 9:55:09Platform Sprint 1 Q2 - Platform TeaJy 19798 evaluation for ai activity tJiminny+ New labsocc ce Cdcseciie3.Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringoesesare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await s asession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio.Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiSign InSign in with ffoogleSign in with Office 365Login via SSOEUWorldwideDon't have account yet? Speak to your sales leader to get an invite.By using this service you agree to the Privacy Policy and Terms of Service.ProvYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize pageTERMSKNOWLEDGE BASE BLOGrect/google?redirectUrl=...
|
NULL
|
-5831688426993136130
|
NULL
|
visual_change
|
ocr
|
NULL
|
FirefoxFileEoitViewHistoryBookmarksProfilesToolsWi FirefoxFileEoitViewHistoryBookmarksProfilesToolsWindow Helpdoo.siaemne.minny.com~ Google Gemini.. X•< 40 ll © | Daily - Platform • 10m leftA 100%C &• Tue 14 Apr 9:55:09Platform Sprint 1 Q2 - Platform TeaJy 19798 evaluation for ai activity tJiminny+ New labsocc ce Cdcseciie3.Run: Executes the evaluation viaasyncio.run(run_evaluation(...)),which orchestrates the collection,endpoint calling, and result scoringoesesare there soem issueswith this PRShow thinking vYes, there are several issues and code smellsin this Pull Request, ranging from performancebottlenecks to architectural anti-patterns.Here is a breakdown of the most notableissues, categorized by impact:Performance & Async Anti-Patterns• Sequential Requests Defeating Async(Lines 86-106): Inside_collect_from_es, the code loopsthrough raw_samples and await s asession.post request for each sampleone by one. This means the script waitsfor one HTTP request to finish beforestarting the next, completely defeatingthe purpose of using aiohttp forasynchronous concurrency.• Fix: Collect these coroutines into alist and use asyncio.gather ()(ideally wrapped in anasyncio.Semaphore to avoidoverwhelming the endpoint) to run• Enter a prompt for GeminiSign InSign in with ffoogleSign in with Office 365Login via SSOEUWorldwideDon't have account yet? Speak to your sales leader to get an invite.By using this service you agree to the Privacy Policy and Terms of Service.ProvYour Jiminny chats aren't used to improve our models. Gemini is Alane can make mistakes, Inciualne aoout peodleYour privacy & GeminiSummarize pageTERMSKNOWLEDGE BASE BLOGrect/google?redirectUrl=...
|
8591
|
|
8591
|
167
|
0
|
2026-04-14T06:55:06.977785+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149706977_m2.jpg...
|
Firefox
|
Work — Mozilla Firefox
|
1
|
app.staging.jiminny.com
|
monitor_2
|
NULL
|
NULL
|
NULL
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
New Tab
New Tab
Close tab
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow
Parsing:
The script uses
create_parser
to handle CLI arguments like
--data-set
,
--populate-ground-truth-llm
, and
--compare-to-ground-truth
.
Initialization:
Sets up directory paths for inputs and ground truths based on the selected dataset name.
Run:
Executes the evaluation via
asyncio.run(run_evaluation(...))
, which orchestrates the collection, endpoint calling, and result scoring phases.
Copy prompt
Edit
You said are there soem issues with this PR
You said
are there soem issues with this PR
Listen
Show more options
Show thinking
Gemini said
Gemini said
Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.
Here is a breakdown of the most notable issues, categorized by impact:
Performance & Async Anti-Patterns
Performance & Async Anti-Patterns
Sequential Requests Defeating Async (Lines 86-106):
Inside
_collect_from_es
, the code loops through
raw_samples
and
await
s a
session.post
request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using
aiohttp
for asynchronous concurrency.
Fix:
Collect these coroutines into a list and use
asyncio.gather()
(ideally wrapped in an
asyncio.Semaphore
to avoid overwhelming the endpoint) to run the replayability probes concurrently.
Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):
Standard synchronous file operations (
with open(...)
,
json.dump
,
json.load
) are used inside
async def
functions. In Python, this blocks the entire event loop while the disk is reading/writing.
Fix:
Use an asynchronous file I/O library like
aiofiles
, or run the blocking I/O operations in a separate thread using
asyncio.to_thread()
.
Architecture & State Management
Architecture & State Management
Mutating Class-Level Variables (Lines 323-328):
The script parses CLI arguments and directly mutates the class attributes (e.g.,
AiActivityTypeEvaluator.DATA_SET = args.data_set
). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.
Fix:
Pass these arguments into the
__init__
constructor of the instance (
self.data_set = args.data_set
) rather than modifying the class definition globally.
Code Quality & Edge Cases
Code Quality & Edge Cases
Broad Exception Catching (Line 107):
Using a bare
except Exception:
during the replayability probe is dangerous. It will catch
aiohttp.ClientError
(which is intended), but it will also silently swallow
TimeoutError
or even memory errors, making debugging incredibly difficult if the endpoint goes down.
Fix:
Scope the exception to specific expected errors, like
except (aiohttp.ClientError, asyncio.TimeoutError):
.
Inefficient Size Calculation (Line 95):
len(json.dumps(request_body)) < min_size
requires serializing the entire dictionary to a string
just
to check its length. Since
min_size
defaults to
0
, the
if min_size
short-circuit saves it for now, but if anyone sets a minimum size, this will become an unnecessary CPU bottleneck on large datasets.
Potential Path Traversal Risk (Lines 110, 113, 251):
The script writes and reads files directly using
f"{es_id}.json"
and
f"gt_{req_id}.json"
. If
es_id
or
req_id
originate from a database and ever contain characters like
../
, it could result in files being read from or written to unintended directories.
Fix:
Sanitize the IDs or strictly enforce that they only contain alphanumeric characters....
|
[{"role":"AXRadioButton","text [{"role":"AXRadioButton","text":"JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app","depth":4,"bounds":{"left":0.00234375,"top":0.06458333,"width":0.0890625,"height":0.028472222},"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":4,"bounds":{"left":0.0,"top":0.10208333,"width":0.09375,"height":0.028472222},"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":5,"bounds":{"left":0.015625,"top":0.11180556,"width":0.11796875,"height":0.009722223},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":4,"bounds":{"left":0.0,"top":0.13055556,"width":0.09375,"height":0.028472222},"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":5,"bounds":{"left":0.015625,"top":0.14027777,"width":0.1984375,"height":0.009722223},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"New Tab","depth":4,"bounds":{"left":0.0,"top":0.15902779,"width":0.09375,"height":0.028472222},"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":true},{"role":"AXStaticText","text":"New Tab","depth":5,"bounds":{"left":0.015625,"top":0.16875,"width":0.017578125,"height":0.009722223},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Close tab","depth":5,"bounds":{"left":0.07890625,"top":0.16527778,"width":0.009375,"height":0.016666668},"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Tab","depth":4,"bounds":{"left":0.003125,"top":0.18888889,"width":0.08710937,"height":0.022222223},"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Customize sidebar","depth":6,"bounds":{"left":0.003125,"top":0.97430557,"width":0.0125,"height":0.022222223},"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Close Google Gemini (⌃X)","depth":6,"bounds":{"left":0.01640625,"top":0.97430557,"width":0.0125,"height":0.022222223},"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Tabs from other devices","depth":6,"bounds":{"left":0.029296875,"top":0.97430557,"width":0.0125,"height":0.022222223},"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open history (⇧⌘H)","depth":6,"bounds":{"left":0.0421875,"top":0.97430557,"width":0.0125,"height":0.022222223},"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open bookmarks (⌘B)","depth":6,"bounds":{"left":0.05546875,"top":0.97430557,"width":0.0125,"height":0.022222223},"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"AI Chat settings","depth":7,"bounds":{"left":0.2171875,"top":0.06736111,"width":0.0125,"height":0.022222223},"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Close","depth":7,"bounds":{"left":0.23125,"top":0.06736111,"width":0.0125,"height":0.022222223},"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"WORK, Google Account: lukas.kovalik@jiminny.com","depth":12,"bounds":{"left":0.228125,"top":0.10972222,"width":0.015625,"height":0.027777778},"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Main menu","depth":12,"bounds":{"left":0.0984375,"top":0.10972222,"width":0.015625,"height":0.027777778},"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Chat","depth":12,"bounds":{"left":0.1953125,"top":0.10972222,"width":0.015625,"height":0.027777778},"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Open menu for conversation actions.","depth":12,"bounds":{"left":0.2109375,"top":0.10972222,"width":0.015625,"height":0.027777778},"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXHeading","text":"Conversation with Gemini","depth":15,"bounds":{"left":0.09335937,"top":0.14791666,"width":0.000390625,"height":0.00069444446},"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Conversation with Gemini","depth":16,"bounds":{"left":0.09335937,"top":0.15,"width":0.14101562,"height":0.022222223},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Expand","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"This Python script defines the","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator","depth":25,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Core Functionality","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Core Functionality","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Data Collection (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"collect_test_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"* Supports fetching raw samples from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Elasticsearch (ES)","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or loading them from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"local JSON files","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Implements a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"replayability check","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":": It probes the live API endpoint to ensure the activity exists before including it in the dataset.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Filters data based on request body size, language, and the presence of an original response.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Ground Truth Management:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Saves inputs and \"gold standard\" responses to a local directory structure (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"eval_output/ai_activity_type/...","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":").","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Features an optional mode to populate ground truth using a high-order LLM (e.g.,","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"gpt-5_2","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") if the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"POPULATE_GROUND_TRUTH_LLM","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"flag is set.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Evaluation Logic (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"evaluate_against_ground_truth","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Compares the model's predicted","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ai_activity_type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"against either the original ES response or the LLM-generated ground truth.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Assigns a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"quality_score","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"of","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"1.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a match and","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"0.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a mismatch or request failure.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Key Configuration Constants","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Key Configuration Constants","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy table","depth":24,"bounds":{"left":0.22070312,"top":0.0,"width":0.015625,"height":0.027777778},"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Execution Workflow","depth":23,"bounds":{"left":0.103125,"top":0.004166667,"width":0.13320312,"height":0.016666668},"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Execution Workflow","depth":24,"bounds":{"left":0.103125,"top":0.0055555557,"width":0.060546875,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Parsing:","depth":26,"bounds":{"left":0.1171875,"top":0.028472222,"width":0.024609376,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script uses","depth":26,"bounds":{"left":0.14179687,"top":0.028472222,"width":0.04609375,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"create_parser","depth":27,"bounds":{"left":0.19023438,"top":0.029861111,"width":0.042578124,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to handle CLI arguments like","depth":26,"bounds":{"left":0.1171875,"top":0.028472222,"width":0.119140625,"height":0.03263889},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--data-set","depth":27,"bounds":{"left":0.1171875,"top":0.047916666,"width":0.107421875,"height":0.03125},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"bounds":{"left":0.12929687,"top":0.06458333,"width":0.003125,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--populate-ground-truth-llm","depth":27,"bounds":{"left":0.13476562,"top":0.065972224,"width":0.08867188,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", and","depth":26,"bounds":{"left":0.1171875,"top":0.06458333,"width":0.11171875,"height":0.03263889},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--compare-to-ground-truth","depth":27,"bounds":{"left":0.13203125,"top":0.084027775,"width":0.08203125,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"bounds":{"left":0.21640626,"top":0.08263889,"width":0.0015625,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Initialization:","depth":26,"bounds":{"left":0.1171875,"top":0.10625,"width":0.039453126,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sets up directory paths for inputs and ground truths based on the selected dataset name.","depth":26,"bounds":{"left":0.1171875,"top":0.10625,"width":0.11875,"height":0.050694443},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Run:","depth":26,"bounds":{"left":0.1171875,"top":0.16597222,"width":0.013671875,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Executes the evaluation via","depth":26,"bounds":{"left":0.13085938,"top":0.16597222,"width":0.080078125,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.run(run_evaluation(...))","depth":27,"bounds":{"left":0.11953125,"top":0.18541667,"width":0.10507812,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", which orchestrates the collection, endpoint calling, and result scoring phases.","depth":26,"bounds":{"left":0.1171875,"top":0.18402778,"width":0.11289062,"height":0.06875},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"bounds":{"left":0.1203125,"top":0.3013889,"width":0.015625,"height":0.027777778},"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Edit","depth":21,"bounds":{"left":0.1375,"top":0.3013889,"width":0.015625,"height":0.027777778},"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said are there soem issues with this PR","depth":21,"bounds":{"left":0.1609375,"top":0.30972221,"width":0.072265625,"height":0.03888889},"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"bounds":{"left":0.09335937,"top":0.31180555,"width":0.0234375,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"are there soem issues with this PR","depth":23,"bounds":{"left":0.1609375,"top":0.3125,"width":0.062109374,"height":0.034027778},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Listen","depth":22,"bounds":{"left":0.22382812,"top":0.37638888,"width":0.015625,"height":0.027777778},"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"bounds":{"left":0.22382812,"top":0.3715278,"width":0.015625,"height":0.027777778},"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Show thinking","depth":26,"bounds":{"left":0.11953125,"top":0.3784722,"width":0.036328126,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Gemini said","depth":20,"bounds":{"left":0.11679687,"top":0.41458333,"width":0.000390625,"height":0.00069444446},"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"bounds":{"left":0.11679687,"top":0.41597223,"width":0.048046876,"height":0.016666668},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.","depth":24,"bounds":{"left":0.10390625,"top":0.42291668,"width":0.13242188,"height":0.050694443},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Here is a breakdown of the most notable issues, categorized by impact:","depth":24,"bounds":{"left":0.10390625,"top":0.48819444,"width":0.11640625,"height":0.03263889},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Performance & Async Anti-Patterns","depth":23,"bounds":{"left":0.10390625,"top":0.5416667,"width":0.13242188,"height":0.016666668},"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Performance & Async Anti-Patterns","depth":24,"bounds":{"left":0.10390625,"top":0.54305553,"width":0.10859375,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sequential Requests Defeating Async (Lines 86-106):","depth":26,"bounds":{"left":0.11757813,"top":0.5659722,"width":0.11601563,"height":0.03263889},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Inside","depth":26,"bounds":{"left":0.16289063,"top":0.58402777,"width":0.019921875,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"_collect_from_es","depth":27,"bounds":{"left":0.11992188,"top":0.60347223,"width":0.05234375,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", the code loops through","depth":26,"bounds":{"left":0.11757813,"top":0.6020833,"width":0.10429688,"height":0.03263889},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"raw_samples","depth":27,"bounds":{"left":0.14414063,"top":0.6215278,"width":0.0359375,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"and","depth":26,"bounds":{"left":0.18242188,"top":0.6201389,"width":0.0140625,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"await","depth":27,"bounds":{"left":0.19882813,"top":0.6215278,"width":0.01640625,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"s a","depth":26,"bounds":{"left":0.21757813,"top":0.6201389,"width":0.009765625,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"session.post","depth":27,"bounds":{"left":0.11992188,"top":0.63958335,"width":0.039453126,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using","depth":26,"bounds":{"left":0.11757813,"top":0.63819444,"width":0.11640625,"height":0.08680555},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiohttp","depth":27,"bounds":{"left":0.18046875,"top":0.7118056,"width":0.023046875,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for asynchronous concurrency.","depth":26,"bounds":{"left":0.11757813,"top":0.7104167,"width":0.09921875,"height":0.03263889},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"bounds":{"left":0.1296875,"top":0.75208336,"width":0.009765625,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Collect these coroutines into a list and use","depth":28,"bounds":{"left":0.1296875,"top":0.75208336,"width":0.09921875,"height":0.03263889},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.gather()","depth":29,"bounds":{"left":0.16601562,"top":0.77152777,"width":0.05234375,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"(ideally wrapped in an","depth":28,"bounds":{"left":0.1296875,"top":0.77013886,"width":0.09257813,"height":0.03263889},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.Semaphore","depth":29,"bounds":{"left":0.13203125,"top":0.8076389,"width":0.055859376,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to avoid overwhelming the endpoint) to run the replayability probes concurrently.","depth":28,"bounds":{"left":0.1296875,"top":0.80625,"width":0.10546875,"height":0.050694443},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):","depth":26,"bounds":{"left":0.11757813,"top":0.8659722,"width":0.10976563,"height":0.03263889},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Standard synchronous file operations (","depth":26,"bounds":{"left":0.11757813,"top":0.8840278,"width":0.107421875,"height":0.050694443},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"with open(...)","depth":27,"bounds":{"left":0.12226562,"top":0.9215278,"width":0.045703124,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"bounds":{"left":0.1703125,"top":0.9201389,"width":0.003515625,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"json.dump","depth":27,"bounds":{"left":0.17617187,"top":0.9215278,"width":0.029296875,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"bounds":{"left":0.2078125,"top":0.9201389,"width":0.003125,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"json.load","depth":27,"bounds":{"left":0.11992188,"top":0.93958336,"width":0.0296875,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") are used inside","depth":26,"bounds":{"left":0.15195313,"top":0.93819445,"width":0.048046876,"height":0.014583333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"async def","depth":27,"bounds":{"left":0.20234375,"top":0.93958336,"width":0.0296875,"height":0.013194445},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"functions. In Python, this blocks the entire event loop while the disk is reading/writing.","depth":26,"bounds":{"left":0.11757813,"top":0.93819445,"width":0.11875,"height":0.061805546},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"bounds":{"left":0.1296875,"top":1.0,"width":0.009765625,"height":-0.015972257},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Use an asynchronous file I/O library like","depth":28,"bounds":{"left":0.1296875,"top":1.0,"width":0.09335937,"height":-0.015972257},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiofiles","depth":29,"bounds":{"left":0.1625,"top":1.0,"width":0.0265625,"height":-0.035416722},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", or run the blocking I/O operations in a separate thread using","depth":28,"bounds":{"left":0.1296875,"top":1.0,"width":0.10585938,"height":-0.034027815},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.to_thread()","depth":29,"bounds":{"left":0.16914062,"top":1.0,"width":0.0625,"height":-0.07152772},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"bounds":{"left":0.23398438,"top":1.0,"width":0.0015625,"height":-0.07013893},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Architecture & State Management","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Architecture & State Management","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Mutating Class-Level Variables (Lines 323-328):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script parses CLI arguments and directly mutates the class attributes (e.g.,","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator.DATA_SET = args.data_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Pass these arguments into the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"__init__","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"constructor of the instance (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"self.data_set = args.data_set","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") rather than modifying the class definition globally.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Code Quality & Edge Cases","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Code Quality & Edge Cases","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Broad Exception Catching (Line 107):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Using a bare","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"except Exception:","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"during the replayability probe is dangerous. It will catch","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiohttp.ClientError","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"(which is intended), but it will also silently swallow","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"TimeoutError","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or even memory errors, making debugging incredibly difficult if the endpoint goes down.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Scope the exception to specific expected errors, like","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"except (aiohttp.ClientError, asyncio.TimeoutError):","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Inefficient Size Calculation (Line 95):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"len(json.dumps(request_body)) < min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"requires serializing the entire dictionary to a string","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"just","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to check its length. Since","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"defaults to","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"0","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", the","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"if min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"short-circuit saves it for now, but if anyone sets a minimum size, this will become an unnecessary CPU bottleneck on large datasets.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Potential Path Traversal Risk (Lines 110, 113, 251):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script writes and reads files directly using","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"f\"{es_id}.json\"","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"and","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"f\"gt_{req_id}.json\"","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":". If","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es_id","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"req_id","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"originate from a database and ever contain characters like","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"../","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", it could result in files being read from or written to unintended directories.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sanitize the IDs or strictly enforce that they only contain alphanumeric characters.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"}]...
|
4930158969472011583
|
8732321575041020375
|
visual_change
|
accessibility
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
New Tab
New Tab
Close tab
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow
Parsing:
The script uses
create_parser
to handle CLI arguments like
--data-set
,
--populate-ground-truth-llm
, and
--compare-to-ground-truth
.
Initialization:
Sets up directory paths for inputs and ground truths based on the selected dataset name.
Run:
Executes the evaluation via
asyncio.run(run_evaluation(...))
, which orchestrates the collection, endpoint calling, and result scoring phases.
Copy prompt
Edit
You said are there soem issues with this PR
You said
are there soem issues with this PR
Listen
Show more options
Show thinking
Gemini said
Gemini said
Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.
Here is a breakdown of the most notable issues, categorized by impact:
Performance & Async Anti-Patterns
Performance & Async Anti-Patterns
Sequential Requests Defeating Async (Lines 86-106):
Inside
_collect_from_es
, the code loops through
raw_samples
and
await
s a
session.post
request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using
aiohttp
for asynchronous concurrency.
Fix:
Collect these coroutines into a list and use
asyncio.gather()
(ideally wrapped in an
asyncio.Semaphore
to avoid overwhelming the endpoint) to run the replayability probes concurrently.
Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):
Standard synchronous file operations (
with open(...)
,
json.dump
,
json.load
) are used inside
async def
functions. In Python, this blocks the entire event loop while the disk is reading/writing.
Fix:
Use an asynchronous file I/O library like
aiofiles
, or run the blocking I/O operations in a separate thread using
asyncio.to_thread()
.
Architecture & State Management
Architecture & State Management
Mutating Class-Level Variables (Lines 323-328):
The script parses CLI arguments and directly mutates the class attributes (e.g.,
AiActivityTypeEvaluator.DATA_SET = args.data_set
). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.
Fix:
Pass these arguments into the
__init__
constructor of the instance (
self.data_set = args.data_set
) rather than modifying the class definition globally.
Code Quality & Edge Cases
Code Quality & Edge Cases
Broad Exception Catching (Line 107):
Using a bare
except Exception:
during the replayability probe is dangerous. It will catch
aiohttp.ClientError
(which is intended), but it will also silently swallow
TimeoutError
or even memory errors, making debugging incredibly difficult if the endpoint goes down.
Fix:
Scope the exception to specific expected errors, like
except (aiohttp.ClientError, asyncio.TimeoutError):
.
Inefficient Size Calculation (Line 95):
len(json.dumps(request_body)) < min_size
requires serializing the entire dictionary to a string
just
to check its length. Since
min_size
defaults to
0
, the
if min_size
short-circuit saves it for now, but if anyone sets a minimum size, this will become an unnecessary CPU bottleneck on large datasets.
Potential Path Traversal Risk (Lines 110, 113, 251):
The script writes and reads files directly using
f"{es_id}.json"
and
f"gt_{req_id}.json"
. If
es_id
or
req_id
originate from a database and ever contain characters like
../
, it could result in files being read from or written to unintended directories.
Fix:
Sanitize the IDs or strictly enforce that they only contain alphanumeric characters....
|
NULL
|
|
8687
|
166
|
41
|
2026-04-14T06:59:57.516101+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149997516_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFileEditViewTabsBookmarksHistoryExtensionsWindo DiaFileEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)jminay / app&‹> Code11 Pull requestsActionsAll workfiowsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesRelease EnvironmentStart-Stop Environment[include) Start/Stop testing environme._.Marsoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metricsC SRD-677T7 (SRO-67С зy 19798Service-oDe Foet, AnvE SSHEl DutadogActionsCD WikiSecurity and qualityk Insights@ Settings|Fix Dependabot issuessecfix.yml33 workflow runsThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankorRefactor secfix scripts for improved parsing and handling of covered…it $1171as pushed by nikolay-yankovRefactor Slack notification handling in secfix scripts and workflowPu upenosootweewea 2691834 pushed by sikolay-yankor• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issues• COstI CHO Actions :autonate-dependabot-issue_autonate-dependabot-issue.0U10110 -003000000183300autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - PlatformC Ct: Depe• Actions s• DependelD TиsksInsights & Coachin.• Dev- UXQ Type ([] to searchQ Fiter workdiow runsEvent-Status -BranchActor -Hn wOrKow88 Apr 9, 10-20 PM GMT+3Ô Sm 35s88 Aor 8, 10:16 PM GMT+3Ô FailureE8 Aor 9, 10:12 PM OMT+3Ô FalureE Aor 9, 10:04 PM GMT+3© 50sỆ Apr 9, 9:48 PM GMT+3Ô em 128|Eệ Aor 9, 9-48 PM GMT+3(0) 150B Apr 9, 7:30 PM GMT+3|Ae 9, 718 PM CMTA)Daily - Platform - 6 m left100% [8• Tue 14 Apr 9:59:57ChatO Far DepeEA oootnrtStefka StoyanovaNikolay YankovSteliyan GeorgievNikolay IvanovAneliya Angelova14:32Lukas Kovalik...
|
NULL
|
3734174758061693885
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFileEditViewTabsBookmarksHistoryExtensionsWindo DiaFileEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)jminay / app&‹> Code11 Pull requestsActionsAll workfiowsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesRelease EnvironmentStart-Stop Environment[include) Start/Stop testing environme._.Marsoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metricsC SRD-677T7 (SRO-67С зy 19798Service-oDe Foet, AnvE SSHEl DutadogActionsCD WikiSecurity and qualityk Insights@ Settings|Fix Dependabot issuessecfix.yml33 workflow runsThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankorRefactor secfix scripts for improved parsing and handling of covered…it $1171as pushed by nikolay-yankovRefactor Slack notification handling in secfix scripts and workflowPu upenosootweewea 2691834 pushed by sikolay-yankor• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issues• COstI CHO Actions :autonate-dependabot-issue_autonate-dependabot-issue.0U10110 -003000000183300autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - PlatformC Ct: Depe• Actions s• DependelD TиsksInsights & Coachin.• Dev- UXQ Type ([] to searchQ Fiter workdiow runsEvent-Status -BranchActor -Hn wOrKow88 Apr 9, 10-20 PM GMT+3Ô Sm 35s88 Aor 8, 10:16 PM GMT+3Ô FailureE8 Aor 9, 10:12 PM OMT+3Ô FalureE Aor 9, 10:04 PM GMT+3© 50sỆ Apr 9, 9:48 PM GMT+3Ô em 128|Eệ Aor 9, 9-48 PM GMT+3(0) 150B Apr 9, 7:30 PM GMT+3|Ae 9, 718 PM CMTA)Daily - Platform - 6 m left100% [8• Tue 14 Apr 9:59:57ChatO Far DepeEA oootnrtStefka StoyanovaNikolay YankovSteliyan GeorgievNikolay IvanovAneliya Angelova14:32Lukas Kovalik...
|
NULL
|
|
8686
|
166
|
40
|
2026-04-14T06:59:54.474787+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149994474_m1.jpg...
|
Alfred
|
Alfred
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
Alfred Search Field
|
[{"role":"AXTextField","text [{"role":"AXTextField","text":"Alfred Search Field","depth":1,"bounds":{"left":0.26180556,"top":0.16777778,"width":0.4763889,"height":0.05888889},"help_text":"Alfred Search","role_description":"text field","is_enabled":true,"is_focused":true}]...
|
7926243118367575
|
7570943109877468232
|
visual_change
|
hybrid
|
NULL
|
Alfred Search Field
DiaFileEditViewTabsBookmarksHi Alfred Search Field
DiaFileEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)C SRD-677C 3y 19790( Actions 11 Sentryjminay / app&‹> Code11 Pull requestsActionsAll workfiowsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesRelease EnvironmentStart-Stop Environment[include) Start/Stop testing environme.Marsoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metricsE SSHActionsCD WikiSecurity and qualityk InsightsFix Dependabot issuessecfix.yml33 workflow runsThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankorRefactor secfix scripts for improved parsing and handling of covered…it $1171as pushed by nikolay-yankovRefactor Slack notification handling in secfix scripts and workflowPu upenosootweewea 2691834 pushed by sikolay-yankor• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issuesDr Foet AwrlEl Dutados O @ # ClaudeO Circieci@ Settings|autonate-dependabot-issue_autonate-dependabot-issue.0U10110 -003000000183300autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - PlatformDaily - Platform - 6 m left100% [8• Tue 14 Apr 9:59:54ChatC C: Depe() Actions( DepenD TиsksInsights & Coachin.D Dev•ux.Q Type ([] to search+ •Q Fiter workdiow runsEvent -Status -BranchActor -Hn wOrKow88 Apr 9, 10-20 PM GMT+3Ô Sm 35s88 Aor 8, 10:16 PM GMT+3Ô FailureE8 Aor 9, 10:12 PM OMT+3Ô FalureE Aor 9, 10:04 PM GMT+3© 50sỆ Apr 9, 9:48 PM GMT+3Ô em 128|Eệ Aor 9, 9-48 PM GMT+3(0) 150B Apr 9, 7:30 PM GMT+3|Ae 9, 718 PM CMTA)Va degDA oootnrtStefka StoyanovaNikolay YankovSteliyan GeorgievNikolay IvanovAneliya AngelovaLukas Kovalik14:29...
|
8683
|
|
8683
|
166
|
39
|
2026-04-14T06:59:52.651443+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149992651_m1.jpg...
|
Dia
|
Work: Meet - Daily - Pla…
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
Nikolay Yankov (Presenting)
Nikolay Yankov (Presen Nikolay Yankov (Presenting)
Nikolay Yankov (Presenting)
People
7
Take notes with Gemini
Take notes with Gemini
Gemini
Gemini
Zoom in
Open in new window
Enter Full Screen
Stefka Stoyanova
Nikolay Yankov
Steliyan Georgiev
Nikolay Ivanov
Aneliya Angelova
Lukas Kovalik
Others might see more of your background. Click to view your full video.
9:59
AM
Daily - Platform
Daily - Platform
Audio settings
Turn on microphone
Video settings
Turn off camera
Nikolay Yankov is presenting
Send a reaction
Turn on captions
Raise hand (ctrl + ⌘ + h)
More options
Leave call
Meeting details
Chat with everyone
Meeting tools
Aneliya Angelova joined...
|
[{"role":"AXHeading","text" [{"role":"AXHeading","text":"Nikolay Yankov (Presenting)","depth":13,"bounds":{"left":0.045833334,"top":0.11,"width":0.125,"height":0.022222223},"role_description":"heading"},{"role":"AXStaticText","text":"Nikolay Yankov (Presenting)","depth":14,"bounds":{"left":0.045833334,"top":0.11111111,"width":0.125,"height":0.02},"role_description":"text"},{"role":"AXPopUpButton","text":"People","depth":15,"bounds":{"left":0.8826389,"top":0.09888889,"width":0.04097222,"height":0.04},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"7","depth":18,"bounds":{"left":0.91041666,"top":0.11,"width":0.0048611113,"height":0.016666668},"role_description":"text"},{"role":"AXPopUpButton","text":"Take notes with Gemini","depth":15,"bounds":{"left":0.9291667,"top":0.09888889,"width":0.025,"height":0.04},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Take notes with Gemini","depth":19,"bounds":{"left":0.93194443,"top":0.11,"width":0.022222223,"height":0.016666668},"role_description":"text"},{"role":"AXStaticText","text":"Gemini","depth":19,"bounds":{"left":0.9625,"top":0.11,"width":0.022222223,"height":0.016666668},"role_description":"text"},{"role":"AXButton","text":"Gemini","depth":18,"bounds":{"left":0.9604167,"top":0.1,"width":0.023611112,"height":0.037777778},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Zoom in","depth":14,"bounds":{"left":0.62083334,"top":0.7922222,"width":0.027777778,"height":0.044444446},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Open in new window","depth":15,"bounds":{"left":0.65416664,"top":0.7922222,"width":0.027777778,"height":0.044444446},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Enter Full Screen","depth":15,"bounds":{"left":0.6875,"top":0.7922222,"width":0.027777778,"height":0.044444446},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Stefka Stoyanova","depth":17,"bounds":{"left":0.7430556,"top":0.36666667,"width":0.08888889,"height":0.022222223},"role_description":"text"},{"role":"AXStaticText","text":"Nikolay Yankov","depth":17,"bounds":{"left":0.87222224,"top":0.36666667,"width":0.077083334,"height":0.022222223},"role_description":"text"},{"role":"AXStaticText","text":"Steliyan Georgiev","depth":17,"bounds":{"left":0.7430556,"top":0.6188889,"width":0.09097222,"height":0.022222223},"role_description":"text"},{"role":"AXStaticText","text":"Nikolay Ivanov","depth":17,"bounds":{"left":0.87222224,"top":0.6188889,"width":0.075,"height":0.022222223},"role_description":"text"},{"role":"AXStaticText","text":"Aneliya Angelova","depth":17,"bounds":{"left":0.7430556,"top":0.8711111,"width":0.08888889,"height":0.022222223},"role_description":"text"},{"role":"AXStaticText","text":"Lukas Kovalik","depth":17,"bounds":{"left":0.87222224,"top":0.8711111,"width":0.06944445,"height":0.022222223},"role_description":"text"},{"role":"AXButton","text":"Others might see more of your background. Click to view your full video.","depth":14,"bounds":{"left":0.9618056,"top":0.86777776,"width":0.018055556,"height":0.028888889},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"9:59","depth":10,"bounds":{"left":0.020833334,"top":0.93777776,"width":0.022222223,"height":0.022222223},"role_description":"text"},{"role":"AXStaticText","text":"AM","depth":10,"bounds":{"left":0.045833334,"top":0.93777776,"width":0.018055556,"height":0.022222223},"role_description":"text"},{"role":"AXHeading","text":"Daily - Platform","depth":10,"bounds":{"left":0.08125,"top":0.90444446,"width":0.08194444,"height":0.08888889},"role_description":"heading"},{"role":"AXStaticText","text":"Daily - Platform","depth":13,"bounds":{"left":0.08125,"top":0.93777776,"width":0.08194444,"height":0.022222223},"role_description":"text"},{"role":"AXPopUpButton","text":"Audio settings","depth":14,"bounds":{"left":0.30416667,"top":0.9222222,"width":0.06111111,"height":0.053333335},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Turn on microphone","depth":14,"bounds":{"left":0.33194444,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":true,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"Video settings","depth":14,"bounds":{"left":0.37083334,"top":0.9222222,"width":0.06111111,"height":0.053333335},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Turn off camera","depth":14,"bounds":{"left":0.3986111,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Nikolay Yankov is presenting","depth":12,"bounds":{"left":0.4375,"top":0.9222222,"width":0.03888889,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXCheckBox","text":"Send a reaction","depth":12,"bounds":{"left":0.48194444,"top":0.9222222,"width":0.03888889,"height":0.053333335},"role_description":"toggle button","subrole":"AXToggleButton","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Turn on captions","depth":11,"bounds":{"left":0.5263889,"top":0.9222222,"width":0.03888889,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXCheckBox","text":"Raise hand (ctrl + ⌘ + h)","depth":11,"bounds":{"left":0.5708333,"top":0.9222222,"width":0.03888889,"height":0.053333335},"role_description":"toggle button","subrole":"AXToggleButton","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"More options","depth":14,"bounds":{"left":0.61527777,"top":0.9222222,"width":0.025,"height":0.053333335},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Leave call","depth":11,"bounds":{"left":0.6458333,"top":0.9222222,"width":0.05,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Meeting details","depth":12,"bounds":{"left":0.8875,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Chat with everyone","depth":12,"bounds":{"left":0.92083335,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Meeting tools","depth":12,"bounds":{"left":0.95416665,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Aneliya Angelova joined","depth":4,"bounds":{"left":0.004166667,"top":0.08111111,"width":0.00069444446,"height":0.0011111111},"role_description":"text"}]...
|
6456261976390749238
|
5823429591821252920
|
app_switch
|
accessibility
|
NULL
|
Nikolay Yankov (Presenting)
Nikolay Yankov (Presen Nikolay Yankov (Presenting)
Nikolay Yankov (Presenting)
People
7
Take notes with Gemini
Take notes with Gemini
Gemini
Gemini
Zoom in
Open in new window
Enter Full Screen
Stefka Stoyanova
Nikolay Yankov
Steliyan Georgiev
Nikolay Ivanov
Aneliya Angelova
Lukas Kovalik
Others might see more of your background. Click to view your full video.
9:59
AM
Daily - Platform
Daily - Platform
Audio settings
Turn on microphone
Video settings
Turn off camera
Nikolay Yankov is presenting
Send a reaction
Turn on captions
Raise hand (ctrl + ⌘ + h)
More options
Leave call
Meeting details
Chat with everyone
Meeting tools
Aneliya Angelova joined...
|
NULL
|
|
8682
|
166
|
38
|
2026-04-14T06:59:45.398064+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149985398_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)jminay / app&‹> Code11 Pull requestsActionsAll workfiowsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesRelease EnvironmentStart-Stop Environment[include) Start/Stop testing environme...Marsoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metricsC SRD-677T7 (SRO-67С зy 19798Service-oDe Foet, AnvE SSHEl DutadogActionsCD WikiSecurity and qualityk Insights@ Settings|Fix Dependabot issuessecfix.yml33 workflow runsThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankorRefactor secfix scripts for improved parsing and handling of covered…it $1171as pushed by nikolay-yankovRefactor Slack notification handling in secfix scripts and workflowPu upenosootweewea 2691834 pushed by sikolay-yankor• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issues• COstI CHO Actions :autonate-dependabot-issue_autonate-dependabot-issue.0U10110 -003000000183300autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - Platform> 0.Daily - Platform • 6 m left8•C Ct: Depe• Actions s• DependelO Far DepeD TиsksInsights & Coachin.• Dev•ux.Q Type ([] to search+ •Q Fiter workdiow runsEvent-Status -BranchActor -Hn wOrKow88 Apr 9, 10-20 PM GMT+3Ô Sm 35s88 Aor 8, 10:16 PM GMT+3Ô FailureE8 Aor 9, 10:12 PM OMT+3Ô FalureE Aor 9, 10:04 PM GMT+3© 50sỆ Apr 9, 9:48 PM GMT+3Ô em 128|Eệ Aor 9, 9-48 PM GMT+3(0) 150B Apr 9, 7:30 PM GMT+3|Ae 9, 718 PM CMTA)ue 14 Apr 9:59EA ooownrtStefka StoyanovaSteliyan GeorgievAneliya Angelova14:208 • Tue 14 Apr 9:59:45Chat•B•Nikolay YankovNikolay IvanovLukas Kovalik...
|
NULL
|
3534206569133913824
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)jminay / app&‹> Code11 Pull requestsActionsAll workfiowsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesRelease EnvironmentStart-Stop Environment[include) Start/Stop testing environme...Marsoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metricsC SRD-677T7 (SRO-67С зy 19798Service-oDe Foet, AnvE SSHEl DutadogActionsCD WikiSecurity and qualityk Insights@ Settings|Fix Dependabot issuessecfix.yml33 workflow runsThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankorRefactor secfix scripts for improved parsing and handling of covered…it $1171as pushed by nikolay-yankovRefactor Slack notification handling in secfix scripts and workflowPu upenosootweewea 2691834 pushed by sikolay-yankor• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issues• COstI CHO Actions :autonate-dependabot-issue_autonate-dependabot-issue.0U10110 -003000000183300autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - Platform> 0.Daily - Platform • 6 m left8•C Ct: Depe• Actions s• DependelO Far DepeD TиsksInsights & Coachin.• Dev•ux.Q Type ([] to search+ •Q Fiter workdiow runsEvent-Status -BranchActor -Hn wOrKow88 Apr 9, 10-20 PM GMT+3Ô Sm 35s88 Aor 8, 10:16 PM GMT+3Ô FailureE8 Aor 9, 10:12 PM OMT+3Ô FalureE Aor 9, 10:04 PM GMT+3© 50sỆ Apr 9, 9:48 PM GMT+3Ô em 128|Eệ Aor 9, 9-48 PM GMT+3(0) 150B Apr 9, 7:30 PM GMT+3|Ae 9, 718 PM CMTA)ue 14 Apr 9:59EA ooownrtStefka StoyanovaSteliyan GeorgievAneliya Angelova14:208 • Tue 14 Apr 9:59:45Chat•B•Nikolay YankovNikolay IvanovLukas Kovalik...
|
8681
|
|
8681
|
166
|
37
|
2026-04-14T06:59:42.355506+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149982355_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)nboк (21SRD-677T7 (SRO-67C зy 19796( Service-cJz Font Awe• COstI CHE3 SSHEl Dutados3 Circiecjminay / app&‹> Code11 Pull requests 24ActionsCD WikiSecurity and qualityk Insights@ Settings|ActionsFix Dependabot issuessecfix.ymlAll workfiows$1NNotion[include) Database Migrations[include) Start/Stop testing environme...Refactor Slack notification handling in secfix scripts and workflowPu Dpenosootiwee# 2681d34 pushed by nikolay-yankovMandoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metrics• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pilkolay-yankovFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by nikolay-yankoy• Fix Dependabot issuesQ Actions :C Ct: DepeD Tиsks27 insights & CoachinD DevQ Type to searchQ Fiter workdiow runs•ux.autonate-dependabot-issue-auto300-0epc00000530Cautonate-dependabot-issue..autonate-dependabot-issue..E8 Aor 9, 10:12 PM OMT+3Ô FalureE Aor 9, 10:04 PM GMT+3© 50sỆ Apr 9, 9:48 PM GMT+3Ô em 128|68 Aor 9, 9:48 PM GMT+3(0) 150E8 Apr 9, 7:30 PM GMT+3Ô 4m 238Apr 9,7116 PM GMTe9:59 AM Daily - PlatformDaily - Platform • 6 m leftO Far DepeStefka StoyanovaSteliyan GeorgievAneliya Angelova14:178 • Tue 14 Apr 9:59:42• ChatNikolay YankovNikolay IvanovLukas Kovalik...
|
NULL
|
-8529167732228103597
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)nboк (21SRD-677T7 (SRO-67C зy 19796( Service-cJz Font Awe• COstI CHE3 SSHEl Dutados3 Circiecjminay / app&‹> Code11 Pull requests 24ActionsCD WikiSecurity and qualityk Insights@ Settings|ActionsFix Dependabot issuessecfix.ymlAll workfiows$1NNotion[include) Database Migrations[include) Start/Stop testing environme...Refactor Slack notification handling in secfix scripts and workflowPu Dpenosootiwee# 2681d34 pushed by nikolay-yankovMandoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metrics• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pilkolay-yankovFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by nikolay-yankoy• Fix Dependabot issuesQ Actions :C Ct: DepeD Tиsks27 insights & CoachinD DevQ Type to searchQ Fiter workdiow runs•ux.autonate-dependabot-issue-auto300-0epc00000530Cautonate-dependabot-issue..autonate-dependabot-issue..E8 Aor 9, 10:12 PM OMT+3Ô FalureE Aor 9, 10:04 PM GMT+3© 50sỆ Apr 9, 9:48 PM GMT+3Ô em 128|68 Aor 9, 9:48 PM GMT+3(0) 150E8 Apr 9, 7:30 PM GMT+3Ô 4m 238Apr 9,7116 PM GMTe9:59 AM Daily - PlatformDaily - Platform • 6 m leftO Far DepeStefka StoyanovaSteliyan GeorgievAneliya Angelova14:178 • Tue 14 Apr 9:59:42• ChatNikolay YankovNikolay IvanovLukas Kovalik...
|
NULL
|
|
8680
|
166
|
36
|
2026-04-14T06:59:39.355417+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149979355_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)иbок (2)C SRD-677T7 (SRO-67С зy 19798Service-oDe Foet, Anv• COstI CHE SSHEl Dutadogjminay / app&‹> Code11 Pull requestsActionsCD WikiSecurity and qualityk Insights@ Settings|ActionsFix Dependabot issuessecfix.ymlAll workfiows33 workflow runsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankorRelease EnvironmentStart-Stop EnvironmentRefactor secfix scripts for improved parsing and handling of covered…it $1171as pushed by nikolay-yankov[include) Start/Stop testing environme…Refactor Slack notification handling in secfix scripts and workflowPu upenosootweewea 2691834 pushed by sikolay-yankorMarsoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metrics• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issuesO Actions :autonate-dependabot-issue_autonate-dependabot-issue.0U10110 -003000000183300autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - Platform> 0.Daily - Platform • 6 m left8•C Ct: Depe• Actions s• DependelO Far DepeD TиsksInsights & Coachin.• Dev•ux.Q Type ([] to search+ •Q Fiter workdiow runsEvent-Status -BranchActor -Hn wOrKow88 Apr 9, 10-20 PM GMT+3Ô Sm 35s88 Aor 8, 10:16 PM GMT+3Ô FailureE8 Aor 9, 10:12 PM OMT+3Ô FalureE Aor 9, 10:04 PM GMT+3© 50sỆ Apr 9, 9:48 PM GMT+3Ô em 128|Eệ Aor 9, 9-48 PM GMT+3(0) 150B Apr 9, 7:30 PM GMT+3|Ae 9, 718 PM CMTA)ue 14 Apr 9:59EA ooownrtStefka StoyanovaSteliyan GeorgievAneliya Angelova14:148 • Tue 14 Apr 9:59:39ChatNikolay YankovNikolay IvanovLukas Kovalik...
|
NULL
|
-231202419043424571
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)иbок (2)C SRD-677T7 (SRO-67С зy 19798Service-oDe Foet, Anv• COstI CHE SSHEl Dutadogjminay / app&‹> Code11 Pull requestsActionsCD WikiSecurity and qualityk Insights@ Settings|ActionsFix Dependabot issuessecfix.ymlAll workfiows33 workflow runsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankorRelease EnvironmentStart-Stop EnvironmentRefactor secfix scripts for improved parsing and handling of covered…it $1171as pushed by nikolay-yankov[include) Start/Stop testing environme…Refactor Slack notification handling in secfix scripts and workflowPu upenosootweewea 2691834 pushed by sikolay-yankorMarsoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metrics• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issuesO Actions :autonate-dependabot-issue_autonate-dependabot-issue.0U10110 -003000000183300autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - Platform> 0.Daily - Platform • 6 m left8•C Ct: Depe• Actions s• DependelO Far DepeD TиsksInsights & Coachin.• Dev•ux.Q Type ([] to search+ •Q Fiter workdiow runsEvent-Status -BranchActor -Hn wOrKow88 Apr 9, 10-20 PM GMT+3Ô Sm 35s88 Aor 8, 10:16 PM GMT+3Ô FailureE8 Aor 9, 10:12 PM OMT+3Ô FalureE Aor 9, 10:04 PM GMT+3© 50sỆ Apr 9, 9:48 PM GMT+3Ô em 128|Eệ Aor 9, 9-48 PM GMT+3(0) 150B Apr 9, 7:30 PM GMT+3|Ae 9, 718 PM CMTA)ue 14 Apr 9:59EA ooownrtStefka StoyanovaSteliyan GeorgievAneliya Angelova14:148 • Tue 14 Apr 9:59:39ChatNikolay YankovNikolay IvanovLukas Kovalik...
|
8679
|
|
8679
|
166
|
35
|
2026-04-14T06:59:36.351777+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149976351_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)иbок (2)C SRD-677T7 (SRO-67С зy 19798Service-oDe Foet, Anv• COstI CHE SSHEl Dutadogjminay / app&‹> Code11 Pull requestsActionsCD WikiSecurity and qualityk Insights@ Settings|ActionsFix Dependabot issuessecfix.ymlAll workfiows33 workflow runsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankorRelease EnvironmentStart-Stop EnvironmentRefactor secfix scripts for improved parsing and handling of covered…it $1171as pushed by nikolay-yankov[include) Start/Stop testing environme…Refactor Slack notification handling in secfix scripts and workflowPu upenosootweewea 2691834 pushed by sikolay-yankorMarsoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metrics• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issuesO Actions :autonate-dependabot-issue_autonate-dependabot-issue.0U10110 -003000000183300autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - Platform> 0.Daily - Platform • 6 m left8•C Ct: Depe• Actions s• DependelO Far DepeD TиsksInsights & Coachin.• Dev•ux.Q Type ([] to search+ •Q Fiter workdiow runsEvent-Status -BranchActor -Hn wOrKow88 Apr 9, 10-20 PM GMT+3Ô Sm 35sE Apr 9, 10-16 PM GMT+3Ô FailureE8 Aor 9, 10:12 PM OMT+3Ô FalureE Aor 9, 10:04 PM GMT+3© 50sỆ Apr 9, 9:48 PM GMT+3Ô em 128|Eệ Aor 9, 9-48 PM GMT+3(0) 150B Apr 9, 7:30 PM GMT+3|Ae 9, 718 PM CMTA)ue 14 Apr 9:59EA ooownrtStefka StoyanovaSteliyan GeorgievAneliya Angelova14:118 • Tue 14 Apr 9:59:36ChatNikolay YankovNikolay IvanovLukas Kovalik...
|
NULL
|
4147430899678859583
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)иbок (2)C SRD-677T7 (SRO-67С зy 19798Service-oDe Foet, Anv• COstI CHE SSHEl Dutadogjminay / app&‹> Code11 Pull requestsActionsCD WikiSecurity and qualityk Insights@ Settings|ActionsFix Dependabot issuessecfix.ymlAll workfiows33 workflow runsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankorRelease EnvironmentStart-Stop EnvironmentRefactor secfix scripts for improved parsing and handling of covered…it $1171as pushed by nikolay-yankov[include) Start/Stop testing environme…Refactor Slack notification handling in secfix scripts and workflowPu upenosootweewea 2691834 pushed by sikolay-yankorMarsoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metrics• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issuesO Actions :autonate-dependabot-issue_autonate-dependabot-issue.0U10110 -003000000183300autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - Platform> 0.Daily - Platform • 6 m left8•C Ct: Depe• Actions s• DependelO Far DepeD TиsksInsights & Coachin.• Dev•ux.Q Type ([] to search+ •Q Fiter workdiow runsEvent-Status -BranchActor -Hn wOrKow88 Apr 9, 10-20 PM GMT+3Ô Sm 35sE Apr 9, 10-16 PM GMT+3Ô FailureE8 Aor 9, 10:12 PM OMT+3Ô FalureE Aor 9, 10:04 PM GMT+3© 50sỆ Apr 9, 9:48 PM GMT+3Ô em 128|Eệ Aor 9, 9-48 PM GMT+3(0) 150B Apr 9, 7:30 PM GMT+3|Ae 9, 718 PM CMTA)ue 14 Apr 9:59EA ooownrtStefka StoyanovaSteliyan GeorgievAneliya Angelova14:118 • Tue 14 Apr 9:59:36ChatNikolay YankovNikolay IvanovLukas Kovalik...
|
NULL
|
|
8676
|
166
|
34
|
2026-04-14T06:59:26.261860+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149966261_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)иbок (2)C SRD-677T7 (SRO-67С зy 19798Service-oDe Foet, Anv• COstI CHE SSHEl Dutadogjminay / app&‹> Code11 Pull requestsActionsCD WikiSecurity and qualityk Insights@ Settings|ActionsFix Dependabot issuessecfix.ymlAll workfiows33 workflow runsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankovRelease EnvironmentStart-Stop EnvironmentRefactor secfix scripts for improved parsing and handling of covered…it 51171aa pushed by nikolay-yankkov[include) Start/Stop testing environme.Refactor Slack notification handling in secfix scripts and workflowPu upenosootweewe00OSe PWROOFTRONTaraosMarsoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metrics• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issuesO Actions :autonate-dependabot-issue_autonate-dependabot-issue.0U10110 -003000000183300autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - Platform> 0.Daily - Platform • 6 m leftue 14 Apr 9:59C Ct: Depe• Actions s• DependelO Far DepeD TиsksInsights & Coachin.• Dev- UXQ Type ([] to search+ •Q Fiter workdiow runsEvent-Status -BranchActor -H wOrmowStefka StoyanovaBranchesФрY-COSe-20-08-0980deploy-master-25-02-2026-myackovdeploy-master-24-02-2026automate-hubspot-app-deploymentФигтк-феpоnо8008-195005-1001E Aor 9, 9:48 PM GMT43Ô 158B Apr 9, 7:30 PM GMT+3|Ae 9, 718 PM CMTA)Steliyan GeorgievAneliya Angelova14:018 • Tue 14 Apr 9:59:26Chat•j•Nikolay YankovNikolay IvanovLukas Kovalik...
|
NULL
|
7059213043168540124
|
NULL
|
click
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)иbок (2)C SRD-677T7 (SRO-67С зy 19798Service-oDe Foet, Anv• COstI CHE SSHEl Dutadogjminay / app&‹> Code11 Pull requestsActionsCD WikiSecurity and qualityk Insights@ Settings|ActionsFix Dependabot issuessecfix.ymlAll workfiows33 workflow runsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankovRelease EnvironmentStart-Stop EnvironmentRefactor secfix scripts for improved parsing and handling of covered…it 51171aa pushed by nikolay-yankkov[include) Start/Stop testing environme.Refactor Slack notification handling in secfix scripts and workflowPu upenosootweewe00OSe PWROOFTRONTaraosMarsoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metrics• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issuesO Actions :autonate-dependabot-issue_autonate-dependabot-issue.0U10110 -003000000183300autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - Platform> 0.Daily - Platform • 6 m leftue 14 Apr 9:59C Ct: Depe• Actions s• DependelO Far DepeD TиsksInsights & Coachin.• Dev- UXQ Type ([] to search+ •Q Fiter workdiow runsEvent-Status -BranchActor -H wOrmowStefka StoyanovaBranchesФрY-COSe-20-08-0980deploy-master-25-02-2026-myackovdeploy-master-24-02-2026automate-hubspot-app-deploymentФигтк-феpоnо8008-195005-1001E Aor 9, 9:48 PM GMT43Ô 158B Apr 9, 7:30 PM GMT+3|Ae 9, 718 PM CMTA)Steliyan GeorgievAneliya Angelova14:018 • Tue 14 Apr 9:59:26Chat•j•Nikolay YankovNikolay IvanovLukas Kovalik...
|
8675
|
|
8675
|
166
|
33
|
2026-04-14T06:59:21.245798+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149961245_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)jminay / app&‹> Code11 Pull requestsActionsAll workfiowsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesRelease EnvironmentStart-Stop Environment[include) Start/Stop testing environme...Mandoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metricsC SRD-677T7 (SRO-67С зy 19798Service-oDe Foet, AnvE SSHEl DutadogActionsCD WikiSecurity and qualityk Insights@ Settings|Fix Dependabot issuessecfix.yml33 workflow runsThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankorO Refactor secfix scripts for improved parsing and handling of coveredl...it $1171as pushed by nikolay-yankovRefactor Slack notification handling in secfix scripts and workflowPu wpenosootweewe00OSe PWROOFTRONTaraos• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issues• COstI CHO Actions :autonate-dependabot-issue_autonate-dependabot-issue.autonate-dependabot-issue_autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - PlatformC Ct: Depe• Actions s• DependelD TиsksInsights & Coachin.• Dev- UXQ Type ([] to searchQ Fiter workdiow runsEvent-Status -BranchActor -H wOrmowSelect branch |Tagshutspot-webhooks-bckpfa-planet-migrationsfa-fontamesome-download-issuesfeature(JY-19120-chorus-retryfeature(JY-16492-office-add-inSeatureT-10400© 4m 236Daily - Platform • 6 m leftO Far Depeue 14 Apr 9:59EA ooownrtStefka StoyanovaSteliyan GeorgievAneliya Angelova13:5688 • Tue 14 Apr 9:59:21ChatNikolay YankovNikolay IvanovLukas Kovalik...
|
NULL
|
-8928779894947133075
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)jminay / app&‹> Code11 Pull requestsActionsAll workfiowsAPP DeploymentClaude Security Helper|Database MigrationsDependabot UpdatesRelease EnvironmentStart-Stop Environment[include) Start/Stop testing environme...Mandoecien8 CachesA Deployments• Attestations|EB Runners@ Usage metrics@ Performance metricsC SRD-677T7 (SRO-67С зy 19798Service-oDe Foet, AnvE SSHEl DutadogActionsCD WikiSecurity and qualityk Insights@ Settings|Fix Dependabot issuessecfix.yml33 workflow runsThis workflow has a workflow_dispatch event trigger.• Fix Dependabot issuesFix Dependabot issues 842: Manually run by pikolay-yankorO Refactor secfix scripts for improved parsing and handling of coveredl...it $1171as pushed by nikolay-yankovRefactor Slack notification handling in secfix scripts and workflowPu wpenosootweewe00OSe PWROOFTRONTaraos• Fix Dependabot issuesmu vepchos0otn3n con Monanh by magen• Fix Dependabot issuesFix Dependabot istues R38: Manually run by pikolay-yankoyFix Dependabot issuesFix Dependabot issues #37: Manually run by nikolay-yankor• Fix Dependabot issuesFix Dependabot issues #36: Manually run by pikolay-yankov• Fix Dependabot issues• COstI CHO Actions :autonate-dependabot-issue_autonate-dependabot-issue.autonate-dependabot-issue_autonate-dependabot-issue.autonate-dependabot-issue_9:59 AM Daily - PlatformC Ct: Depe• Actions s• DependelD TиsksInsights & Coachin.• Dev- UXQ Type ([] to searchQ Fiter workdiow runsEvent-Status -BranchActor -H wOrmowSelect branch |Tagshutspot-webhooks-bckpfa-planet-migrationsfa-fontamesome-download-issuesfeature(JY-19120-chorus-retryfeature(JY-16492-office-add-inSeatureT-10400© 4m 236Daily - Platform • 6 m leftO Far Depeue 14 Apr 9:59EA ooownrtStefka StoyanovaSteliyan GeorgievAneliya Angelova13:5688 • Tue 14 Apr 9:59:21ChatNikolay YankovNikolay IvanovLukas Kovalik...
|
NULL
|
|
8672
|
166
|
32
|
2026-04-14T06:59:17.604857+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149957604_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DialFilelEditViewTabsBookmarksHistoryExtensionsWin DialFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)> 0.Daily - Platform • 6 m leftPlatform TeamREADY FOR DEV 1ooaiesnoTxwitexceed Fontawesomepackage limitsMAINTENANEDReady for Dev8 -20564 1T*ae+=S)))SRD-677С зy 19790Service-oie Foet, Ane• COstI CHhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/372selectedissue=JY-205510 ProjectsE SSHEl Dutadog* ClaudeO Circiec800$ Jy-12253 / E JY-20551Experiment with AI workflow for automated Dependabot fixes~ Descriptionproposed changes.Try to build a workflow that automatically detects open Dependabot issues, fixes them with Al agent and opens a PR withSubtasksLinked work itemsAdd linked work itemActivityCommentsHistoryWork lognou s commuasieSuggest a reply….Status update...Thanks…Pro tip: press M to commentNikolay Yankov9 Aprl 2026 at 14:17https://github.com/jminny/app/pull/11875Q Actions :C Ct: Depe• Actions :• Dependel0 Tasks|7 Insights & Coachin.• Dev•ux.In Dev vIx Improve Technical StoryDetailsAssigneeNikolay YankovReporterNikolay YankovQuick start developmentLink this work item to your codeoy meboine kor wnen croounea branch, commit, or pull requestUuneoomeriQ Open with VS Code1J Create branch4 Create commitCoiootenPlatformSub-Product8•• Far DepeAl BookmarksGroup: Queriesfix cowiniome ouring tofeature flag reieaseSUPPORT TICKETSDeployedXE -20565 15 •***=AI Review - Q1GROWTH - MAINTAIN OUR COMPE...Deployed_10082|1 • ..** =No such column "Type' onentity 'Event"DeployedE-20384 05 # *** =®Missed deletion of staleClosed02000 1 *=[Pipedrive) DI > status notlabeled, date data missingД -20014 05 01 **** =|Stefka StoyanovaSteliyan Georgiev9:59 AM Daily - Platform:Aneliya Angelova13:538 • Tue 14 Apr 9:59:17ChatNikolay YankovNikolay IvanovLukas Kovalik...
|
NULL
|
2699606893527805357
|
NULL
|
click
|
ocr
|
NULL
|
DialFilelEditViewTabsBookmarksHistoryExtensionsWin DialFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)> 0.Daily - Platform • 6 m leftPlatform TeamREADY FOR DEV 1ooaiesnoTxwitexceed Fontawesomepackage limitsMAINTENANEDReady for Dev8 -20564 1T*ae+=S)))SRD-677С зy 19790Service-oie Foet, Ane• COstI CHhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/372selectedissue=JY-205510 ProjectsE SSHEl Dutadog* ClaudeO Circiec800$ Jy-12253 / E JY-20551Experiment with AI workflow for automated Dependabot fixes~ Descriptionproposed changes.Try to build a workflow that automatically detects open Dependabot issues, fixes them with Al agent and opens a PR withSubtasksLinked work itemsAdd linked work itemActivityCommentsHistoryWork lognou s commuasieSuggest a reply….Status update...Thanks…Pro tip: press M to commentNikolay Yankov9 Aprl 2026 at 14:17https://github.com/jminny/app/pull/11875Q Actions :C Ct: Depe• Actions :• Dependel0 Tasks|7 Insights & Coachin.• Dev•ux.In Dev vIx Improve Technical StoryDetailsAssigneeNikolay YankovReporterNikolay YankovQuick start developmentLink this work item to your codeoy meboine kor wnen croounea branch, commit, or pull requestUuneoomeriQ Open with VS Code1J Create branch4 Create commitCoiootenPlatformSub-Product8•• Far DepeAl BookmarksGroup: Queriesfix cowiniome ouring tofeature flag reieaseSUPPORT TICKETSDeployedXE -20565 15 •***=AI Review - Q1GROWTH - MAINTAIN OUR COMPE...Deployed_10082|1 • ..** =No such column "Type' onentity 'Event"DeployedE-20384 05 # *** =®Missed deletion of staleClosed02000 1 *=[Pipedrive) DI > status notlabeled, date data missingД -20014 05 01 **** =|Stefka StoyanovaSteliyan Georgiev9:59 AM Daily - Platform:Aneliya Angelova13:538 • Tue 14 Apr 9:59:17ChatNikolay YankovNikolay IvanovLukas Kovalik...
|
8669
|
|
8669
|
166
|
31
|
2026-04-14T06:59:15.715581+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149955715_m1.jpg...
|
Firefox
|
Your Ask Jiminny test report report is ready - luk Your Ask Jiminny test report report is ready - lukas.kovalik@jiminny.com - Jiminny Mail — Work...
|
1
|
mail.google.com/mail/u/0/#inbox/FMfcgzQgLPHqhfTqvc mail.google.com/mail/u/0/#inbox/FMfcgzQgLPHqhfTqvcRjxxxrPDNRGZTl...
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Your Ask Jiminny test report report is ready - [EMAIL] - Jiminny Mail
Your Ask Jiminny test report report is ready - [EMAIL] - Jiminny Mail
Close tab
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow...
|
[{"role":"AXRadioButton","text [{"role":"AXRadioButton","text":"JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jiminny","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jiminny","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Your Ask Jiminny test report report is ready - lukas.kovalik@jiminny.com - Jiminny Mail","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":true},{"role":"AXStaticText","text":"Your Ask Jiminny test report report is ready - lukas.kovalik@jiminny.com - Jiminny Mail","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Close tab","depth":5,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Tab","depth":4,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Customize sidebar","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Close Google Gemini (⌃X)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Tabs from other devices","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open history (⇧⌘H)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open bookmarks (⌘B)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"AI Chat settings","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Close","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"WORK, Google Account: lukas.kovalik@jiminny.com","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Main menu","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Chat","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Open menu for conversation actions.","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXHeading","text":"Conversation with Gemini","depth":15,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Conversation with Gemini","depth":16,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Expand","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"This Python script defines the","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator","depth":25,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Core Functionality","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Core Functionality","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Data Collection (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"collect_test_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"* Supports fetching raw samples from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Elasticsearch (ES)","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or loading them from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"local JSON files","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Implements a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"replayability check","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":": It probes the live API endpoint to ensure the activity exists before including it in the dataset.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Filters data based on request body size, language, and the presence of an original response.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Ground Truth Management:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Saves inputs and \"gold standard\" responses to a local directory structure (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"eval_output/ai_activity_type/...","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":").","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Features an optional mode to populate ground truth using a high-order LLM (e.g.,","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"gpt-5_2","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") if the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"POPULATE_GROUND_TRUTH_LLM","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"flag is set.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Evaluation Logic (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"evaluate_against_ground_truth","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Compares the model's predicted","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ai_activity_type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"against either the original ES response or the LLM-generated ground truth.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Assigns a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"quality_score","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"of","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"1.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a match and","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"0.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a mismatch or request failure.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Key Configuration Constants","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Key Configuration Constants","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy table","depth":24,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Execution Workflow","depth":23,"bounds":{"left":0.008333334,"top":0.0,"width":0.11840278,"height":0.013333334},"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Execution Workflow","depth":24,"bounds":{"left":0.008333334,"top":0.0,"width":0.053819444,"height":0.011666667}}]...
|
-706451971627840756
|
4985326685068824967
|
click
|
accessibility
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Your Ask Jiminny test report report is ready - [EMAIL] - Jiminny Mail
Your Ask Jiminny test report report is ready - [EMAIL] - Jiminny Mail
Close tab
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow...
|
NULL
|
|
8667
|
166
|
30
|
2026-04-14T06:59:06.025467+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149946025_m1.jpg...
|
Firefox
|
Your Ask Jiminny test report report is ready - luk Your Ask Jiminny test report report is ready - lukas.kovalik@jiminny.com - Jiminny Mail — Work...
|
1
|
mail.google.com/mail/u/0/#inbox/FMfcgzQgLPHqhfTqvc mail.google.com/mail/u/0/#inbox/FMfcgzQgLPHqhfTqvcRjxxxrPDNRGZTl...
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Your Ask Jiminny test report report is ready - [EMAIL] - Jiminny Mail
Your Ask Jiminny test report report is ready - [EMAIL] - Jiminny Mail
Close tab
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow
Parsing:
The script uses
create_parser
to handle CLI arguments like
--data-set
,
--populate-ground-truth-llm
, and
--compare-to-ground-truth
.
Initialization:
Sets up directory paths for inputs and ground truths based on the selected dataset name.
Run:
Executes the evaluation via
asyncio.run(run_evaluation(...))
, which orchestrates the collection, endpoint calling, and result scoring phases.
Copy prompt
Edit
You said are there soem issues with this PR
You said
are there soem issues with this PR
Listen
Show more options
Show thinking
Gemini said
Gemini said
Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.
Here is a breakdown of the most notable issues, categorized by impact:
Performance & Async Anti-Patterns
Performance & Async Anti-Patterns
Sequential Requests Defeating Async (Lines 86-106):
Inside
_collect_from_es
, the code loops through
raw_samples
and
await
s a
session.post
request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using
aiohttp
for asynchronous concurrency.
Fix:
Collect these coroutines into a list and use
asyncio.gather()
(ideally wrapped in an
asyncio.Semaphore
to avoid overwhelming the endpoint) to run the replayability probes concurrently.
Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):
Standard synchronous file operations (
with open(...)
,
json.dump
,
json.load
) are used inside
async def
functions. In Python, this blocks the entire event loop while the disk is reading/writing.
Fix:
Use an asynchronous file I/O library like
aiofiles
, or run the blocking I/O operations in a separate thread using
asyncio.to_thread()
.
Architecture & State Management
Architecture & State Management
Mutating Class-Level Variables (Lines 323-328):
The script parses CLI arguments and directly mutates the class attributes (e.g.,
AiActivityTypeEvaluator.DATA_SET = args.data_set
). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.
Fix:
Pass these arguments into the
__init__
constructor of the instance (
self.data_set = args.data_set
) rather than modifying the class definition globally.
Code Quality & Edge Cases
Code Quality & Edge Cases
Broad Exception Catching (Line 107):
Using a bare
except Exception:
during the replayability probe is dangerous. It will catch
aiohttp.ClientError
(which is intended), but it will also silently swallow
TimeoutError
or even memory errors, making debugging incredibly difficult if the endpoint goes down.
Fix:
Scope the exception to specific expected errors, like
except (aiohttp.ClientError, asyncio.TimeoutError):
.
Inefficient Size Calculation (Line 95):
len(json.dumps(request_body)) < min_size
requires serializing the entire dictionary to a string
just
to check its length. Since
min_size
defaults to
0
, the
if min_size
short-circuit saves it for now, but if anyone sets a minimum size, this will become an unnecessary CPU bottleneck on large datasets.
Potential Path Traversal Risk (Lines 110, 113, 251):
The script writes and reads files directly using...
|
[{"role":"AXRadioButton","text [{"role":"AXRadioButton","text":"JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jiminny","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jiminny","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Your Ask Jiminny test report report is ready - lukas.kovalik@jiminny.com - Jiminny Mail","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":true},{"role":"AXStaticText","text":"Your Ask Jiminny test report report is ready - lukas.kovalik@jiminny.com - Jiminny Mail","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Close tab","depth":5,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Tab","depth":4,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Customize sidebar","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Close Google Gemini (⌃X)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Tabs from other devices","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open history (⇧⌘H)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open bookmarks (⌘B)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"AI Chat settings","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Close","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"WORK, Google Account: lukas.kovalik@jiminny.com","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Main menu","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Chat","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Open menu for conversation actions.","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXHeading","text":"Conversation with Gemini","depth":15,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Conversation with Gemini","depth":16,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Expand","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"This Python script defines the","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator","depth":25,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Core Functionality","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Core Functionality","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Data Collection (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"collect_test_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"* Supports fetching raw samples from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Elasticsearch (ES)","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or loading them from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"local JSON files","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Implements a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"replayability check","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":": It probes the live API endpoint to ensure the activity exists before including it in the dataset.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Filters data based on request body size, language, and the presence of an original response.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Ground Truth Management:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Saves inputs and \"gold standard\" responses to a local directory structure (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"eval_output/ai_activity_type/...","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":").","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Features an optional mode to populate ground truth using a high-order LLM (e.g.,","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"gpt-5_2","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") if the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"POPULATE_GROUND_TRUTH_LLM","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"flag is set.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Evaluation Logic (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"evaluate_against_ground_truth","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Compares the model's predicted","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ai_activity_type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"against either the original ES response or the LLM-generated ground truth.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Assigns a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"quality_score","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"of","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"1.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a match and","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"0.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a mismatch or request failure.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Key Configuration Constants","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Key Configuration Constants","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy table","depth":24,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Execution Workflow","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Execution Workflow","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Parsing:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script uses","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"create_parser","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to handle CLI arguments like","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--data-set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--populate-ground-truth-llm","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", and","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--compare-to-ground-truth","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Initialization:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sets up directory paths for inputs and ground truths based on the selected dataset name.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Run:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Executes the evaluation via","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.run(run_evaluation(...))","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", which orchestrates the collection, endpoint calling, and result scoring phases.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Edit","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said are there soem issues with this PR","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"are there soem issues with this PR","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Show thinking","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Here is a breakdown of the most notable issues, categorized by impact:","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Performance & Async Anti-Patterns","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Performance & Async Anti-Patterns","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sequential Requests Defeating Async (Lines 86-106):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Inside","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"_collect_from_es","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", the code loops through","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"raw_samples","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"and","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"await","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"s a","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"session.post","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiohttp","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for asynchronous concurrency.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Collect these coroutines into a list and use","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.gather()","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"(ideally wrapped in an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.Semaphore","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to avoid overwhelming the endpoint) to run the replayability probes concurrently.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Standard synchronous file operations (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"with open(...)","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"json.dump","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"json.load","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") are used inside","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"async def","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"functions. In Python, this blocks the entire event loop while the disk is reading/writing.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Use an asynchronous file I/O library like","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiofiles","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", or run the blocking I/O operations in a separate thread using","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.to_thread()","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"bounds":{"left":0.0013888889,"top":0.08111111,"width":0.0027777778,"height":0.023333333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Architecture & State Management","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Architecture & State Management","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Mutating Class-Level Variables (Lines 323-328):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script parses CLI arguments and directly mutates the class attributes (e.g.,","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator.DATA_SET = args.data_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Pass these arguments into the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"__init__","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"constructor of the instance (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"self.data_set = args.data_set","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") rather than modifying the class definition globally.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Code Quality & Edge Cases","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Code Quality & Edge Cases","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Broad Exception Catching (Line 107):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Using a bare","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"except Exception:","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"during the replayability probe is dangerous. It will catch","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiohttp.ClientError","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"(which is intended), but it will also silently swallow","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"TimeoutError","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or even memory errors, making debugging incredibly difficult if the endpoint goes down.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Scope the exception to specific expected errors, like","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"except (aiohttp.ClientError, asyncio.TimeoutError):","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"bounds":{"left":0.0,"top":1.0,"width":0.0027777778,"height":-0.04111111},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Inefficient Size Calculation (Line 95):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"len(json.dumps(request_body)) < min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"requires serializing the entire dictionary to a string","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"just","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to check its length. Since","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"defaults to","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"0","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", the","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"if min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"short-circuit saves it for now, but if anyone sets a minimum size, this will become an unnecessary CPU bottleneck on large datasets.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Potential Path Traversal Risk (Lines 110, 113, 251):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script writes and reads files directly using","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"}]...
|
-2892713408067612165
|
6426478565828440471
|
visual_change
|
accessibility
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Your Ask Jiminny test report report is ready - [EMAIL] - Jiminny Mail
Your Ask Jiminny test report report is ready - [EMAIL] - Jiminny Mail
Close tab
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow
Parsing:
The script uses
create_parser
to handle CLI arguments like
--data-set
,
--populate-ground-truth-llm
, and
--compare-to-ground-truth
.
Initialization:
Sets up directory paths for inputs and ground truths based on the selected dataset name.
Run:
Executes the evaluation via
asyncio.run(run_evaluation(...))
, which orchestrates the collection, endpoint calling, and result scoring phases.
Copy prompt
Edit
You said are there soem issues with this PR
You said
are there soem issues with this PR
Listen
Show more options
Show thinking
Gemini said
Gemini said
Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.
Here is a breakdown of the most notable issues, categorized by impact:
Performance & Async Anti-Patterns
Performance & Async Anti-Patterns
Sequential Requests Defeating Async (Lines 86-106):
Inside
_collect_from_es
, the code loops through
raw_samples
and
await
s a
session.post
request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using
aiohttp
for asynchronous concurrency.
Fix:
Collect these coroutines into a list and use
asyncio.gather()
(ideally wrapped in an
asyncio.Semaphore
to avoid overwhelming the endpoint) to run the replayability probes concurrently.
Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):
Standard synchronous file operations (
with open(...)
,
json.dump
,
json.load
) are used inside
async def
functions. In Python, this blocks the entire event loop while the disk is reading/writing.
Fix:
Use an asynchronous file I/O library like
aiofiles
, or run the blocking I/O operations in a separate thread using
asyncio.to_thread()
.
Architecture & State Management
Architecture & State Management
Mutating Class-Level Variables (Lines 323-328):
The script parses CLI arguments and directly mutates the class attributes (e.g.,
AiActivityTypeEvaluator.DATA_SET = args.data_set
). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.
Fix:
Pass these arguments into the
__init__
constructor of the instance (
self.data_set = args.data_set
) rather than modifying the class definition globally.
Code Quality & Edge Cases
Code Quality & Edge Cases
Broad Exception Catching (Line 107):
Using a bare
except Exception:
during the replayability probe is dangerous. It will catch
aiohttp.ClientError
(which is intended), but it will also silently swallow
TimeoutError
or even memory errors, making debugging incredibly difficult if the endpoint goes down.
Fix:
Scope the exception to specific expected errors, like
except (aiohttp.ClientError, asyncio.TimeoutError):
.
Inefficient Size Calculation (Line 95):
len(json.dumps(request_body)) < min_size
requires serializing the entire dictionary to a string
just
to check its length. Since
min_size
defaults to
0
, the
if min_size
short-circuit saves it for now, but if anyone sets a minimum size, this will become an unnecessary CPU bottleneck on large datasets.
Potential Path Traversal Risk (Lines 110, 113, 251):
The script writes and reads files directly using...
|
8666
|
|
8666
|
166
|
29
|
2026-04-14T06:59:02.997685+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149942997_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DialFilelEditViewTabsBookmarksHistoryExtensionsWin DialFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Daily - Platform • 6 m left8•)))nbox (2)SRD-677C Jy 19796(7 Service-o xie Font Ame• COMTICHhttps://fjiminny.atlassian.net/jira/software/c/projects/JY/boards/37?selectedissue=JY-20551Simiri0 ProjectsB AWSE3 SSHElDatadog * Claude3 CircieciPlatform TeamQ Search board800$ JY-12253 / E JY-20551READY FOR DEV 1Experiment with AI workflow for automated Dependabot fixesexceed Fontawesomepackage limitsMAINTENANESReady for Dev8-205641 П ****:~ Descriptionproposed chandesTry to bulld a workflow that automatically detects open Dependabot issues, fixes them with Al agent and opens a PR with |SubtasksLinked work itemsActivityCommentsHistoryWork logNUU & COITETISTnSuggest a reply…..Status update...Thanks….Pro tip: press M to commertcohny Yanko9 April 2026 at 14.17• Actions :С Ct: Depe• Actions sDependolD TasksInsights & Coachin.• Del0ux.©2|In Dev v#x Improve Technical Story~ DetailsAssignee& Nikolay YanikowKeoohNikolay YankovQuick start developmentLink this work item to your codeby including keys when creatinga branch, commit, or pull requestbelow. Learn moreDismissDevelopment@ Open with VS Code1 Create branch+ Create commitComponentsPlatformSub-Product• Far DepeD Al Bookmarks9:59 AM Daily - PlatformGroup: QueriesPuooanonoeeleature fiag releaseDeployed0 -20565 15 + ****=AI Review - Q1CTNTHEVATAORCEVDeployed__19982 |1 ••**=No sucn coumh "lype onentty 'Event"PLATTORM STABTLTTYDeployed8 -20384 05 1 ****=Missed deletion of stale0m400R3 409000(Pipedrive) DI > status notlabeled, date data missingREDUCT CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan GeorgievLovalik13:388 • Tue 14 Apr 9:59:02ChatNikolay YankovNikolay Ivanov...
|
NULL
|
-8582562440983614743
|
NULL
|
visual_change
|
ocr
|
NULL
|
DialFilelEditViewTabsBookmarksHistoryExtensionsWin DialFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Daily - Platform • 6 m left8•)))nbox (2)SRD-677C Jy 19796(7 Service-o xie Font Ame• COMTICHhttps://fjiminny.atlassian.net/jira/software/c/projects/JY/boards/37?selectedissue=JY-20551Simiri0 ProjectsB AWSE3 SSHElDatadog * Claude3 CircieciPlatform TeamQ Search board800$ JY-12253 / E JY-20551READY FOR DEV 1Experiment with AI workflow for automated Dependabot fixesexceed Fontawesomepackage limitsMAINTENANESReady for Dev8-205641 П ****:~ Descriptionproposed chandesTry to bulld a workflow that automatically detects open Dependabot issues, fixes them with Al agent and opens a PR with |SubtasksLinked work itemsActivityCommentsHistoryWork logNUU & COITETISTnSuggest a reply…..Status update...Thanks….Pro tip: press M to commertcohny Yanko9 April 2026 at 14.17• Actions :С Ct: Depe• Actions sDependolD TasksInsights & Coachin.• Del0ux.©2|In Dev v#x Improve Technical Story~ DetailsAssignee& Nikolay YanikowKeoohNikolay YankovQuick start developmentLink this work item to your codeby including keys when creatinga branch, commit, or pull requestbelow. Learn moreDismissDevelopment@ Open with VS Code1 Create branch+ Create commitComponentsPlatformSub-Product• Far DepeD Al Bookmarks9:59 AM Daily - PlatformGroup: QueriesPuooanonoeeleature fiag releaseDeployed0 -20565 15 + ****=AI Review - Q1CTNTHEVATAORCEVDeployed__19982 |1 ••**=No sucn coumh "lype onentty 'Event"PLATTORM STABTLTTYDeployed8 -20384 05 1 ****=Missed deletion of stale0m400R3 409000(Pipedrive) DI > status notlabeled, date data missingREDUCT CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan GeorgievLovalik13:388 • Tue 14 Apr 9:59:02ChatNikolay YankovNikolay Ivanov...
|
NULL
|
|
8665
|
166
|
28
|
2026-04-14T06:58:59.975976+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149939975_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelp> 0.meet.google.com / Daily - PlatformNikolay Yankov (Presenting)SRD-677XC Jy 19796(7 Service-ords/37?selectedissue=JY-20551El Datadogie Font. AmrCostI ChActionsС Ct: DepeD Tasks• Actions s27 Insights & CoachielSminnyPlatform TeamQ Search boardREADY FOR DEV 1hoeenotxwirtexceed Fontawesomepockage limitsMASNTENANESReady for Dev027-20564 1Projects806• Dev9:58 AM ||Daily - PlatformDaily - Platform - 7 m left8•Dependal0ux.• Far DepeGroup: QueriesDEPLOY &feature fiag releaseSUPPORT TICKETSDeployed0 -205e5 15 + •***=AI Review - 01DeployeQ -19982 1••= 9No sucn column Type onentty 'Event"PLATTORM STARILITYDeployed2038405 •.00Missed deletion of staleREDUGE CHURN0 -4001a-ene=(Pipedrive] DI > status notlabeled, date data missingREDUSE CHURNDeployeo0 -200м 03 l1 •***=Stefka StoyanovaSteliyan Georgievlik13:358 • Tue 14 Apr 9:58:59ChatNikolay YankovNikolay Ivanov...
|
NULL
|
1268666469939478919
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelp> 0.meet.google.com / Daily - PlatformNikolay Yankov (Presenting)SRD-677XC Jy 19796(7 Service-ords/37?selectedissue=JY-20551El Datadogie Font. AmrCostI ChActionsС Ct: DepeD Tasks• Actions s27 Insights & CoachielSminnyPlatform TeamQ Search boardREADY FOR DEV 1hoeenotxwirtexceed Fontawesomepockage limitsMASNTENANESReady for Dev027-20564 1Projects806• Dev9:58 AM ||Daily - PlatformDaily - Platform - 7 m left8•Dependal0ux.• Far DepeGroup: QueriesDEPLOY &feature fiag releaseSUPPORT TICKETSDeployed0 -205e5 15 + •***=AI Review - 01DeployeQ -19982 1••= 9No sucn column Type onentty 'Event"PLATTORM STARILITYDeployed2038405 •.00Missed deletion of staleREDUGE CHURN0 -4001a-ene=(Pipedrive] DI > status notlabeled, date data missingREDUSE CHURNDeployeo0 -200м 03 l1 •***=Stefka StoyanovaSteliyan Georgievlik13:358 • Tue 14 Apr 9:58:59ChatNikolay YankovNikolay Ivanov...
|
8664
|
|
8664
|
166
|
27
|
2026-04-14T06:58:53.958034+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149933958_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677(7 (5RD-67)C Jy 1979607 Service-oir Font. Amr• COMTICHhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-20553• Sminn0 Projects# AWS|E SSHEl Dutadog* Claude3 CircieciPlatform TeamQ Search board800$ JY-20285 / A JY-20553READY FOR DEV 1Linked work itemsAdd linked work itemooesnoTixWiwexceed Fontawesomepackage limitsMAINTENANCISReady for Dev8-20564 1~ ActivitywotanoiStatus update...Pro tip: press (M to commNikolay Nikolov &Nhwewoalo4onConsider this has helped for the drop from 19 to 6 H: () https://github.com/jminny/app/pull/11796Connect.youcolay ikoloy9 Aprl 2026 at 16:28oonioruOeyE (Draft) Debug crm-sync long jobsPossible worker outage starting at 22:00 UTC on 2026-03-24:• Actions :A, Sentryl9:58 AM | Daily - Platform(ah|Daily - Platform • 7 m left8•С Ct: Depe• Actions sDependol• Far DepeD Tasks7 Insights & Coachin• DevDux.D Al BookmarksIn Dev vXZ Improve Story~ DetailsNikolay NikolovAsuntoeReporter8 Stetka Stoyanovauuick star dcvciopmentLink this work item to your codeoy inciuding xeys wiith Creaungla branch, commit, or pull request |below. Learn moreDismissDevelopment• Oocn with VS CodelCreate branch4 Create commitComponentsPlatformSub-ProductAdd optionsGroup: QueriesDEPLOY &"uoehnonoeleature fiag releaseDeployed0 -20585 15 + ****=AI Review - Q1CTNTHEWANTANOURCOVIDeployed19982 1 @ ...•=No such column "Type' onentty 'Event"PLATTORM STABTLITYDeployed9 -20384 05 ^ ****=|Missed deletion of stale(Pipedrive) DI > status notlabeled, date data missingCREDUCE CHURNNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan GeorgievKovalik13:298 • Tue 14 Apr 9:58:53ChatNikolay YankovNikolay Ivanov...
|
NULL
|
-326921371687853529
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677(7 (5RD-67)C Jy 1979607 Service-oir Font. Amr• COMTICHhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-20553• Sminn0 Projects# AWS|E SSHEl Dutadog* Claude3 CircieciPlatform TeamQ Search board800$ JY-20285 / A JY-20553READY FOR DEV 1Linked work itemsAdd linked work itemooesnoTixWiwexceed Fontawesomepackage limitsMAINTENANCISReady for Dev8-20564 1~ ActivitywotanoiStatus update...Pro tip: press (M to commNikolay Nikolov &Nhwewoalo4onConsider this has helped for the drop from 19 to 6 H: () https://github.com/jminny/app/pull/11796Connect.youcolay ikoloy9 Aprl 2026 at 16:28oonioruOeyE (Draft) Debug crm-sync long jobsPossible worker outage starting at 22:00 UTC on 2026-03-24:• Actions :A, Sentryl9:58 AM | Daily - Platform(ah|Daily - Platform • 7 m left8•С Ct: Depe• Actions sDependol• Far DepeD Tasks7 Insights & Coachin• DevDux.D Al BookmarksIn Dev vXZ Improve Story~ DetailsNikolay NikolovAsuntoeReporter8 Stetka Stoyanovauuick star dcvciopmentLink this work item to your codeoy inciuding xeys wiith Creaungla branch, commit, or pull request |below. Learn moreDismissDevelopment• Oocn with VS CodelCreate branch4 Create commitComponentsPlatformSub-ProductAdd optionsGroup: QueriesDEPLOY &"uoehnonoeleature fiag releaseDeployed0 -20585 15 + ****=AI Review - Q1CTNTHEWANTANOURCOVIDeployed19982 1 @ ...•=No such column "Type' onentty 'Event"PLATTORM STABTLITYDeployed9 -20384 05 ^ ****=|Missed deletion of stale(Pipedrive) DI > status notlabeled, date data missingCREDUCE CHURNNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan GeorgievKovalik13:298 • Tue 14 Apr 9:58:53ChatNikolay YankovNikolay Ivanov...
|
NULL
|
|
8663
|
166
|
26
|
2026-04-14T06:58:47.915378+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149927915_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DialFilelEditViewTabsBookmarksHistoryExtensionsWin DialFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)SRD-677C Jy 1979607 Service-oir Font. Amr• COMTICHhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-20553• Smir0 Projects# AWS|E SSHE Datados* Claude3 CircieciPlatform Team800$ JY-20285 / A JY-20553READY FOR DEV 1Delays in CRM SyncoohesnetxWiwexceed Fontawesomepackage limitsMAINTENANCISReady for Dev8-20564 1 П ****=Descriptioncrm.sync queue consistently contains messages with 18+ hours age, meanwhile the number of messages is not huge.therefore more worker won't help. Please make sure that the messages in this queue are processed timely - for not morethan 1-2 minutes.Approximate Age 01 Oldest Message* is an AWS SQS metric that measures:The time elopsed since the oldest messoge wos added to the queue and is stil woiting to be received (picked up byo worker)Metric ValueMeaning19 hoursA message (ob) was dispatched 19 hours ago and has notyet been picked up by any workerSubtasksAdd subtaskLinked work items• Actions :A, Sentryl9:58 AM | Daily - Platform(ah|Daily - Platform • 7 m left8•С Ct: Depe• Actions sDependol• Far DepeD Tasks7 Insights & Coachin.• DevDux.D Al BookmarksIn Dev vXZ Improve Story~ DetailsNikolay NikolovAsuntoeReporter8 Stetka StoyanovaQuick start developmentLink this work item to your codeby including keys when creatinga branch, commit, or pull request |below. Learn moreDismissDevelopmentQ Open with VS Code!J Create branch4 Create commitComponentsPlatformSub-ProductAdd optionsGroup: QueriesDEPLOY &leature fiag releaseDeployed0 -20585 15 + ****=AI Review - Q1CTWIHENANANORCOUHDeployed_19082|1 • •***=wo sucn coumh "lype onentty 'Event"PLATTORM STABTLITYDeployed9 -20384 05 11 ****=|Missed deletion of stale74000(Pipedrive) DI > status notlabeled, date data missingDeployed0 -20014 05 П ****=:Stefka StoyanovaSteliyan GeorgievKovalik13:238 • Tue 14 Apr 9:58:47Chat6Nikolay YankovNikolay Ivanov...
|
NULL
|
-7273881887843919756
|
NULL
|
visual_change
|
ocr
|
NULL
|
DialFilelEditViewTabsBookmarksHistoryExtensionsWin DialFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)SRD-677C Jy 1979607 Service-oir Font. Amr• COMTICHhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-20553• Smir0 Projects# AWS|E SSHE Datados* Claude3 CircieciPlatform Team800$ JY-20285 / A JY-20553READY FOR DEV 1Delays in CRM SyncoohesnetxWiwexceed Fontawesomepackage limitsMAINTENANCISReady for Dev8-20564 1 П ****=Descriptioncrm.sync queue consistently contains messages with 18+ hours age, meanwhile the number of messages is not huge.therefore more worker won't help. Please make sure that the messages in this queue are processed timely - for not morethan 1-2 minutes.Approximate Age 01 Oldest Message* is an AWS SQS metric that measures:The time elopsed since the oldest messoge wos added to the queue and is stil woiting to be received (picked up byo worker)Metric ValueMeaning19 hoursA message (ob) was dispatched 19 hours ago and has notyet been picked up by any workerSubtasksAdd subtaskLinked work items• Actions :A, Sentryl9:58 AM | Daily - Platform(ah|Daily - Platform • 7 m left8•С Ct: Depe• Actions sDependol• Far DepeD Tasks7 Insights & Coachin.• DevDux.D Al BookmarksIn Dev vXZ Improve Story~ DetailsNikolay NikolovAsuntoeReporter8 Stetka StoyanovaQuick start developmentLink this work item to your codeby including keys when creatinga branch, commit, or pull request |below. Learn moreDismissDevelopmentQ Open with VS Code!J Create branch4 Create commitComponentsPlatformSub-ProductAdd optionsGroup: QueriesDEPLOY &leature fiag releaseDeployed0 -20585 15 + ****=AI Review - Q1CTWIHENANANORCOUHDeployed_19082|1 • •***=wo sucn coumh "lype onentty 'Event"PLATTORM STABTLITYDeployed9 -20384 05 11 ****=|Missed deletion of stale74000(Pipedrive) DI > status notlabeled, date data missingDeployed0 -20014 05 П ****=:Stefka StoyanovaSteliyan GeorgievKovalik13:238 • Tue 14 Apr 9:58:47Chat6Nikolay YankovNikolay Ivanov...
|
8660
|
|
8660
|
166
|
25
|
2026-04-14T06:58:44.297672+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149924297_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DialFilelEditViewTabsBookmarksHistoryExtensionsWin DialFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)SRD-677C Jy 1979607 Service-oir Font. Amr• COMTICH[URL_WITH_CREDENTIALS] Dev v# Improve Story~ DetailsAssignee #Nikolay NikolovAsunteteReporter8 Stetka StoyanovaQuick start developmentLink this work item to your codeby including keys when creatinga branch, commit, or pull request |below. Learn moreDismissDevelopmentQ Open with VS CodeCreate branch4 Create commitComponentsPlatformSub-ProductAdd optionsGroup: QueriesDEPLOY 8leature fiag releaseDeployed0 -20565 15 * ****=AI Review - Q1CTWIHENANANORCOUHDeployed19982 1 @ ..••=wo sucn coumh "lype onentty 'Event"PLATTORM STABTLITYDeployed9 -20384 05 11 ****=Missed deletion of stale(Pipedrive) DI > status notlabeled, date data missingCREDUCE CHURNNDeployed0 -20014 05 П ****=:Stefka StoyanovaSteliyan GeorgievKovalik13:198 • Tue 14 Apr 9:58:43Chat6Nikolay YankovNikolay Ivanov...
|
NULL
|
-8162506983783773073
|
NULL
|
click
|
ocr
|
NULL
|
DialFilelEditViewTabsBookmarksHistoryExtensionsWin DialFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)SRD-677C Jy 1979607 Service-oir Font. Amr• COMTICH[URL_WITH_CREDENTIALS] Dev v# Improve Story~ DetailsAssignee #Nikolay NikolovAsunteteReporter8 Stetka StoyanovaQuick start developmentLink this work item to your codeby including keys when creatinga branch, commit, or pull request |below. Learn moreDismissDevelopmentQ Open with VS CodeCreate branch4 Create commitComponentsPlatformSub-ProductAdd optionsGroup: QueriesDEPLOY 8leature fiag releaseDeployed0 -20565 15 * ****=AI Review - Q1CTWIHENANANORCOUHDeployed19982 1 @ ..••=wo sucn coumh "lype onentty 'Event"PLATTORM STABTLITYDeployed9 -20384 05 11 ****=Missed deletion of stale(Pipedrive) DI > status notlabeled, date data missingCREDUCE CHURNNDeployed0 -20014 05 П ****=:Stefka StoyanovaSteliyan GeorgievKovalik13:198 • Tue 14 Apr 9:58:43Chat6Nikolay YankovNikolay Ivanov...
|
NULL
|
|
8657
|
166
|
24
|
2026-04-14T06:58:40.145065+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149920145_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelp$0.(ah|Daily - Platform - 7 m leftmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Q 8•• Fa Depe xInbox (2)PlactormSRD-677https://jiminny.atlassian.net/jira/software/c/projects/JY/boards/37Sminn8) ProjectiE SSHPlatform TeamQ Search board2000018Турe чLabel vQuick filtersREADY FOR DEV 1INDEV 4COOE REVIEW 3Part? Automated reoorts)exceed FontawesomeMASNTENANESReady for Dev8-20564 1AJREPORTSIn DevAIBEFE4-16900 115n****= 3Delays in CRM SyncCUATFORM STABTUITYIn DevQJy-20553workfiow for automatedDependabot fixesEMPROVEMENT OF OUR EFFICEINCYIn Dev© J-205512+000sru and staoiize MobieOS using new mobile use...MAINTENANCEIn DevQAД JY-204313****=0Evaluation foe AI ActivityTypesAUTO-DETECTED ACTIVITY TYPECode Review93-19798 15 + •*•=AJ Reports > Tracking LCode Review8JY-20543 1 .0=for SSO for 'persistent'name_id_formatREDUCE CHURNCode ReviewД -37-2063205 •.•=+ CreateC Jy 19796(7 Service-o xfont AwtXO COMICH• Actions :С Ct: DepeQ Actions :DependolEl Dacadog © * Claude5 Circiecl[ Sentry8) TasksInsights & Coachin.• DevDux.Complete sprint8) AI Bookmarks)BLOCKEDPO ACCEPTANCEGroup: QueriesDEPLOY &fix cownome ouring teleature fiag releaseBUPPORT TICKETSDeployed0 -20585 15 ****=AI Review - Q1OeployedP -19982|1 • •=*=no sucn coumn "lype onentity 'Event"PLATTORM STABTUTTYDeployed8 -20384 05 П •***=Missed deletion of staleCEOUCE CHURD0 -20670 /80 01 ...*=(Pipedrive) DI > status not|labeled, date data missingREOUCE CHURNDeployed0 -20614 05 П ****=Stefka StoyanovalSteliyan GeorgievKovalik9:58 AM | Daily - Platform13:158 • Tue 14 Apr 9:58:39ChatNikolay YankovNikolay Ivanov...
|
NULL
|
-3139250388967787009
|
NULL
|
click
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelp$0.(ah|Daily - Platform - 7 m leftmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Q 8•• Fa Depe xInbox (2)PlactormSRD-677https://jiminny.atlassian.net/jira/software/c/projects/JY/boards/37Sminn8) ProjectiE SSHPlatform TeamQ Search board2000018Турe чLabel vQuick filtersREADY FOR DEV 1INDEV 4COOE REVIEW 3Part? Automated reoorts)exceed FontawesomeMASNTENANESReady for Dev8-20564 1AJREPORTSIn DevAIBEFE4-16900 115n****= 3Delays in CRM SyncCUATFORM STABTUITYIn DevQJy-20553workfiow for automatedDependabot fixesEMPROVEMENT OF OUR EFFICEINCYIn Dev© J-205512+000sru and staoiize MobieOS using new mobile use...MAINTENANCEIn DevQAД JY-204313****=0Evaluation foe AI ActivityTypesAUTO-DETECTED ACTIVITY TYPECode Review93-19798 15 + •*•=AJ Reports > Tracking LCode Review8JY-20543 1 .0=for SSO for 'persistent'name_id_formatREDUCE CHURNCode ReviewД -37-2063205 •.•=+ CreateC Jy 19796(7 Service-o xfont AwtXO COMICH• Actions :С Ct: DepeQ Actions :DependolEl Dacadog © * Claude5 Circiecl[ Sentry8) TasksInsights & Coachin.• DevDux.Complete sprint8) AI Bookmarks)BLOCKEDPO ACCEPTANCEGroup: QueriesDEPLOY &fix cownome ouring teleature fiag releaseBUPPORT TICKETSDeployed0 -20585 15 ****=AI Review - Q1OeployedP -19982|1 • •=*=no sucn coumn "lype onentity 'Event"PLATTORM STABTUTTYDeployed8 -20384 05 П •***=Missed deletion of staleCEOUCE CHURD0 -20670 /80 01 ...*=(Pipedrive) DI > status not|labeled, date data missingREOUCE CHURNDeployed0 -20614 05 П ****=Stefka StoyanovalSteliyan GeorgievKovalik9:58 AM | Daily - Platform13:158 • Tue 14 Apr 9:58:39ChatNikolay YankovNikolay Ivanov...
|
8656
|
|
8656
|
166
|
23
|
2026-04-14T06:58:38.883462+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149918883_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelp$0.(ah|Daily - Platform - 7 m leftmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Q 8•• Fa Depe xInbox (2)PlactormSRD-677C7 (5RD-67C Jy 19796(7 Service-o xfont AwtX• COMTICH• Actions :С Ct: DepeQ Actions :Dependolhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/37Sminn8) ProjectiE SSHEl Dacadog © * Claude5 CircieclA Sentry|8) TasksInsights & Coachin.• DevDux.Platform TeamQ Search board2000018Турe чLabel vQuick filtersComplete sprintREADY FOR DEV 1INDEV 4COOE REVIEW 3BLOCKEDPart? Automated reoorts)exceed FontawesomeMASNTENANESReady for Dev8-20564 11*as*AJREPORTSIn DevAIBEFE4-16900 115n****= 3Delays in CRM Sync|CUATFORM STABTUITYIn DevQJy-20553toeweworkfiow for automatedDependabot fixesEMPROVEMENT OF OUR EFFICEINCYIn Dev₴л-205512+000sru and staoiize MobieOS using new mobile use...MAINTENANCEIn DevQAД JY-204313****=0PO ACCEPTANCEEvaluation foe AI ActivityTypesAUTO-DETECTED ACTIVITY TYPECode Review93-19798 15 + •••=+ CreateAJ Reports TrackingCode ReviewД л-20543 1 *** = ©for SSO for 'persistent'name_id_formatREDUCE CHURNCode ReviewД 31-206320,5 ..0=8) AI Bookmarks)Group: QueriesDEPLOY &Hix oownome ouring taleature fiag releaseBUPPORT TICKETSDeployed0 -20585 15 ****=AI Review - Q1OeployedP -19982|1 • •=*=no sucn coumn "lype onentity 'Event"PLATTORM STABTUTTYDeployed8 -20384 05 П •***=Missed deletion of staleCEOUCE CHURD0 -20670 /80 01 ...*=(Pipedrive) DI > status not|labeled, date data missingREOUCE CHURNDeployed0 -20614 05 П ****=Stefka StoyanovaSteliyan GeorgievKovalik9:58 AM | Daily - Platform13:148 • Tue 14 Apr 9:58:38ChatNikolay YankovNikolay Ivanov...
|
NULL
|
7573639158322813260
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelp$0.(ah|Daily - Platform - 7 m leftmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Q 8•• Fa Depe xInbox (2)PlactormSRD-677C7 (5RD-67C Jy 19796(7 Service-o xfont AwtX• COMTICH• Actions :С Ct: DepeQ Actions :Dependolhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/37Sminn8) ProjectiE SSHEl Dacadog © * Claude5 CircieclA Sentry|8) TasksInsights & Coachin.• DevDux.Platform TeamQ Search board2000018Турe чLabel vQuick filtersComplete sprintREADY FOR DEV 1INDEV 4COOE REVIEW 3BLOCKEDPart? Automated reoorts)exceed FontawesomeMASNTENANESReady for Dev8-20564 11*as*AJREPORTSIn DevAIBEFE4-16900 115n****= 3Delays in CRM Sync|CUATFORM STABTUITYIn DevQJy-20553toeweworkfiow for automatedDependabot fixesEMPROVEMENT OF OUR EFFICEINCYIn Dev₴л-205512+000sru and staoiize MobieOS using new mobile use...MAINTENANCEIn DevQAД JY-204313****=0PO ACCEPTANCEEvaluation foe AI ActivityTypesAUTO-DETECTED ACTIVITY TYPECode Review93-19798 15 + •••=+ CreateAJ Reports TrackingCode ReviewД л-20543 1 *** = ©for SSO for 'persistent'name_id_formatREDUCE CHURNCode ReviewД 31-206320,5 ..0=8) AI Bookmarks)Group: QueriesDEPLOY &Hix oownome ouring taleature fiag releaseBUPPORT TICKETSDeployed0 -20585 15 ****=AI Review - Q1OeployedP -19982|1 • •=*=no sucn coumn "lype onentity 'Event"PLATTORM STABTUTTYDeployed8 -20384 05 П •***=Missed deletion of staleCEOUCE CHURD0 -20670 /80 01 ...*=(Pipedrive) DI > status not|labeled, date data missingREOUCE CHURNDeployed0 -20614 05 П ****=Stefka StoyanovaSteliyan GeorgievKovalik9:58 AM | Daily - Platform13:148 • Tue 14 Apr 9:58:38ChatNikolay YankovNikolay Ivanov...
|
NULL
|
|
8652
|
166
|
22
|
2026-04-14T06:58:29.817167+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149909817_m1.jpg...
|
Firefox
|
Inbox (1,540) - lukas.kovalik@jiminny.com - Jiminn Inbox (1,540) - lukas.kovalik@jiminny.com - Jiminny Mail — Work...
|
1
|
mail.google.com/mail/u/0/#inbox
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Inbox (1,540) - [EMAIL] - Jiminny Mail
Inbox (1,540) - [EMAIL] - Jiminny Mail
Close tab
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow
Parsing:
The script uses
create_parser
to handle CLI arguments like
--data-set
,
--populate-ground-truth-llm
, and
--compare-to-ground-truth
.
Initialization:
Sets up directory paths for inputs and ground truths based on the selected dataset name.
Run:
Executes the evaluation via
asyncio.run(run_evaluation(...))
, which orchestrates the collection, endpoint calling, and result scoring phases.
Copy prompt
Edit
You said are there soem issues with this PR
You said
are there soem issues with this PR
Listen
Show more options
Show thinking
Gemini said
Gemini said
Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.
Here is a breakdown of the most notable issues, categorized by impact:
Performance & Async Anti-Patterns
Performance & Async Anti-Patterns
Sequential Requests Defeating Async (Lines 86-106):
Inside
_collect_from_es
, the code loops through
raw_samples
and
await
s a
session.post
request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using
aiohttp
for asynchronous concurrency.
Fix:
Collect these coroutines into a list and use
asyncio.gather()
(ideally wrapped in an
asyncio.Semaphore
to avoid overwhelming the endpoint) to run the replayability probes concurrently.
Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):
Standard synchronous file operations (...
|
[{"role":"AXRadioButton","text [{"role":"AXRadioButton","text":"JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jiminny","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jiminny","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Inbox (1,540) - lukas.kovalik@jiminny.com - Jiminny Mail","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":true},{"role":"AXStaticText","text":"Inbox (1,540) - lukas.kovalik@jiminny.com - Jiminny Mail","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Close tab","depth":5,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Tab","depth":4,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Customize sidebar","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Close Google Gemini (⌃X)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Tabs from other devices","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open history (⇧⌘H)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open bookmarks (⌘B)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"AI Chat settings","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Close","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"WORK, Google Account: lukas.kovalik@jiminny.com","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Main menu","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Chat","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Open menu for conversation actions.","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXHeading","text":"Conversation with Gemini","depth":15,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Conversation with Gemini","depth":16,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Expand","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"This Python script defines the","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator","depth":25,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Core Functionality","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Core Functionality","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Data Collection (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"collect_test_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"* Supports fetching raw samples from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Elasticsearch (ES)","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or loading them from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"local JSON files","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Implements a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"replayability check","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":": It probes the live API endpoint to ensure the activity exists before including it in the dataset.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Filters data based on request body size, language, and the presence of an original response.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Ground Truth Management:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Saves inputs and \"gold standard\" responses to a local directory structure (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"eval_output/ai_activity_type/...","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":").","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Features an optional mode to populate ground truth using a high-order LLM (e.g.,","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"gpt-5_2","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") if the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"POPULATE_GROUND_TRUTH_LLM","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"flag is set.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Evaluation Logic (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"evaluate_against_ground_truth","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Compares the model's predicted","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ai_activity_type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"against either the original ES response or the LLM-generated ground truth.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Assigns a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"quality_score","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"of","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"1.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a match and","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"0.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a mismatch or request failure.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Key Configuration Constants","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Key Configuration Constants","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy table","depth":24,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Execution Workflow","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Execution Workflow","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Parsing:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script uses","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"create_parser","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to handle CLI arguments like","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--data-set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--populate-ground-truth-llm","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", and","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--compare-to-ground-truth","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Initialization:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sets up directory paths for inputs and ground truths based on the selected dataset name.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Run:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Executes the evaluation via","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.run(run_evaluation(...))","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", which orchestrates the collection, endpoint calling, and result scoring phases.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Edit","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said are there soem issues with this PR","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"are there soem issues with this PR","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Show thinking","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Here is a breakdown of the most notable issues, categorized by impact:","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Performance & Async Anti-Patterns","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Performance & Async Anti-Patterns","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sequential Requests Defeating Async (Lines 86-106):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Inside","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"_collect_from_es","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", the code loops through","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"raw_samples","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"and","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"await","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"s a","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"session.post","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiohttp","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for asynchronous concurrency.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Collect these coroutines into a list and use","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.gather()","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"(ideally wrapped in an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.Semaphore","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to avoid overwhelming the endpoint) to run the replayability probes concurrently.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Standard synchronous file operations (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"}]...
|
7018639353123991036
|
6714708941975892375
|
visual_change
|
accessibility
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Inbox (1,540) - [EMAIL] - Jiminny Mail
Inbox (1,540) - [EMAIL] - Jiminny Mail
Close tab
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow
Parsing:
The script uses
create_parser
to handle CLI arguments like
--data-set
,
--populate-ground-truth-llm
, and
--compare-to-ground-truth
.
Initialization:
Sets up directory paths for inputs and ground truths based on the selected dataset name.
Run:
Executes the evaluation via
asyncio.run(run_evaluation(...))
, which orchestrates the collection, endpoint calling, and result scoring phases.
Copy prompt
Edit
You said are there soem issues with this PR
You said
are there soem issues with this PR
Listen
Show more options
Show thinking
Gemini said
Gemini said
Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.
Here is a breakdown of the most notable issues, categorized by impact:
Performance & Async Anti-Patterns
Performance & Async Anti-Patterns
Sequential Requests Defeating Async (Lines 86-106):
Inside
_collect_from_es
, the code loops through
raw_samples
and
await
s a
session.post
request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using
aiohttp
for asynchronous concurrency.
Fix:
Collect these coroutines into a list and use
asyncio.gather()
(ideally wrapped in an
asyncio.Semaphore
to avoid overwhelming the endpoint) to run the replayability probes concurrently.
Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):
Standard synchronous file operations (...
|
8649
|
|
8649
|
166
|
21
|
2026-04-14T06:58:25.770777+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149905770_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)SInbox (2)SRD-677XC Jy 19798(7 Service-oi Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=Jy-18909• Siminm0 ProjectsEI AWSE3 SSHEl DatadogPlatform TeamQ Search board800$ Jy-19240 / [ JY-18900READY FOR DEV 1Subtasks|ooeeheTxWirtexceed Fontawesomepackage limitsMAINTENANCISReady for Dev©-20564 1 П ****=PrioritySto.....8 JY-20574 [AI) Ensure PDF formatting is good= Medium% JY-20582 (BE+AI+Infra] Create new queue= Medium% JY-20586 [QA] Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cioned byA JY-20458 CRUD reports with Ask. JiminnyJY-20849 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an..O Jy-20060 non-creator cannos ocieue generated ask Jiminny reportwrom A.JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n.….JY-20655 Generated Ask Jiminny Report email is sent to recipientsJY-20854 Generated Ask Jiminny Report is visible only to creator and shar...• JY-20650 Generated Ask Jiminny Report PDF includes branding, report se.•**-77% DoneAiwneeSteliyan...it Steliyan ...IN DEVIN DEV SAnellya A...READY FOR DEVS8 Unassign...KAUTPOKOYDEPLOYEDDONEM.DONEDONE MDONEDONE MDONEActions9:58 AM | Daily - Platform> 0.Daily - Platform • 7 m left8•CE: DepeDependol• Far DepeD Tasks7 Insights & Coachin• Del0ux.© 2XZ Improve Story~ DetailsAssignee32 Stellyan GeorgievAsentoeReporter2 Galya DimitrovaDevelopmentO Open with VS Code3 branches32 commits1 pull request1 build failingRelessesProduction7 See all deploymentsComponentsPlatformSub-ProductAooton5 days agcUrEhGroup: QueriesDEPLOY &fix connume ouring tcleature fiag reieaseDeployed0E -20585 15 + ****=AI Review - Q1CTNIHEMANTANORCODeployed_19082|1 • •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed8 -20384 05 11 ****=Missed deletion of stale0 -400034.9000(Pipedrive) DI > status notlabeled, date data missingREDUCE CHURNNDeployed0 -20614 05 П ****=:D Al BookmarksStefka StoyanovaSteliyan GeorgievKovalik13:018 • Tue 14 Apr 9:58:25ChatNikolay YankovNikolay Ivanov...
|
NULL
|
-8869008819253757706
|
NULL
|
click
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)SInbox (2)SRD-677XC Jy 19798(7 Service-oi Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=Jy-18909• Siminm0 ProjectsEI AWSE3 SSHEl DatadogPlatform TeamQ Search board800$ Jy-19240 / [ JY-18900READY FOR DEV 1Subtasks|ooeeheTxWirtexceed Fontawesomepackage limitsMAINTENANCISReady for Dev©-20564 1 П ****=PrioritySto.....8 JY-20574 [AI) Ensure PDF formatting is good= Medium% JY-20582 (BE+AI+Infra] Create new queue= Medium% JY-20586 [QA] Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cioned byA JY-20458 CRUD reports with Ask. JiminnyJY-20849 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an..O Jy-20060 non-creator cannos ocieue generated ask Jiminny reportwrom A.JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n.….JY-20655 Generated Ask Jiminny Report email is sent to recipientsJY-20854 Generated Ask Jiminny Report is visible only to creator and shar...• JY-20650 Generated Ask Jiminny Report PDF includes branding, report se.•**-77% DoneAiwneeSteliyan...it Steliyan ...IN DEVIN DEV SAnellya A...READY FOR DEVS8 Unassign...KAUTPOKOYDEPLOYEDDONEM.DONEDONE MDONEDONE MDONEActions9:58 AM | Daily - Platform> 0.Daily - Platform • 7 m left8•CE: DepeDependol• Far DepeD Tasks7 Insights & Coachin• Del0ux.© 2XZ Improve Story~ DetailsAssignee32 Stellyan GeorgievAsentoeReporter2 Galya DimitrovaDevelopmentO Open with VS Code3 branches32 commits1 pull request1 build failingRelessesProduction7 See all deploymentsComponentsPlatformSub-ProductAooton5 days agcUrEhGroup: QueriesDEPLOY &fix connume ouring tcleature fiag reieaseDeployed0E -20585 15 + ****=AI Review - Q1CTNIHEMANTANORCODeployed_19082|1 • •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed8 -20384 05 11 ****=Missed deletion of stale0 -400034.9000(Pipedrive) DI > status notlabeled, date data missingREDUCE CHURNNDeployed0 -20614 05 П ****=:D Al BookmarksStefka StoyanovaSteliyan GeorgievKovalik13:018 • Tue 14 Apr 9:58:25ChatNikolay YankovNikolay Ivanov...
|
NULL
|
|
8647
|
166
|
20
|
2026-04-14T06:57:59.610437+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149879610_m1.jpg...
|
Firefox
|
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf — Work...
|
1
|
app.staging.jiminny.com/ai-reports/pdf/e2f4890a-d0 app.staging.jiminny.com/ai-reports/pdf/e2f4890a-d097-495b-8dfd-d4c25a5e8843...
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Close tab
Inbox (1,540) - [EMAIL] - Jiminny Mail
Inbox (1,540) - [EMAIL] - Jiminny Mail
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow
Parsing:
The script uses
create_parser
to handle CLI arguments like
--data-set
,
--populate-ground-truth-llm
, and
--compare-to-ground-truth
.
Initialization:
Sets up directory paths for inputs and ground truths based on the selected dataset name.
Run:
Executes the evaluation via
asyncio.run(run_evaluation(...))
, which orchestrates the collection, endpoint calling, and result scoring phases.
Copy prompt
Edit
You said are there soem issues with this PR
You said
are there soem issues with this PR
Listen
Show more options
Show thinking
Gemini said
Gemini said
Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.
Here is a breakdown of the most notable issues, categorized by impact:
Performance & Async Anti-Patterns
Performance & Async Anti-Patterns
Sequential Requests Defeating Async (Lines 86-106):
Inside
_collect_from_es
, the code loops through
raw_samples
and
await
s a
session.post
request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using
aiohttp
for asynchronous concurrency.
Fix:
Collect these coroutines into a list and use
asyncio.gather()
(ideally wrapped in an
asyncio.Semaphore
to avoid overwhelming the endpoint) to run the replayability probes concurrently.
Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):
Standard synchronous file operations (
with open(...)
,
json.dump
,
json.load
) are used inside
async def
functions. In Python, this blocks the entire event loop while the disk is reading/writing.
Fix:
Use an asynchronous file I/O library like
aiofiles
, or run the blocking I/O operations in a separate thread using
asyncio.to_thread()
.
Architecture & State Management
Architecture & State Management
Mutating Class-Level Variables (Lines 323-328):
The script parses CLI arguments and directly mutates the class attributes (e.g.,
AiActivityTypeEvaluator.DATA_SET = args.data_set
). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.
Fix:
Pass these arguments into the
__init__
constructor of the instance (
self.data_set = args.data_set
) rather than modifying the class definition globally.
Code Quality & Edge Cases
Code Quality & Edge Cases
Broad Exception Catching (Line 107):
Using a bare
except Exception:
during the replayability probe is dangerous. It will catch
aiohttp.ClientError
(which is intended), but it will also silently swallow
TimeoutError
or even memory errors, making debugging incredibly difficult if the endpoint goes down.
Fix:
Scope the exception to specific expected errors, like
except (aiohttp.ClientError, asyncio.TimeoutError):
.
Inefficient Size Calculation (Line 95):
len(json.dumps(request_body)) < min_size
requires serializing the entire dictionary to a string
just
to check its length. Since
min_size
defaults to
0
, the
if min_size
short-circuit saves it for now, but if anyone sets a minimum size, this will become an unnecessary CPU bottleneck on large datasets.
Potential Path Traversal Risk (Lines 110, 113, 251):
The script writes and reads files directly using
f"{es_id}.json"
and
f"gt_{req_id}.json"
. If
es_id
or
req_id
originate from a database and ever contain characters like
../
, it could result in files being read from or written to unintended directories.
Fix:
Sanitize the IDs or strictly enforce that they only contain alphanumeric characters.
Would you like me to rewrite the
_collect_from_es
method to show how to fix the sequential async loop and blocking file I/O?
Enter a prompt for Gemini
encrypted
Enter a prompt for Gemini
encrypted
Open upload file menu
Tools
Open mode picker...
|
[{"role":"AXRadioButton","text [{"role":"AXRadioButton","text":"JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jiminny","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jiminny","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":true},{"role":"AXStaticText","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Close tab","depth":5,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Inbox (1,540) - lukas.kovalik@jiminny.com - Jiminny Mail","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Inbox (1,540) - lukas.kovalik@jiminny.com - Jiminny Mail","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"New Tab","depth":4,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Customize sidebar","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Close Google Gemini (⌃X)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Tabs from other devices","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open history (⇧⌘H)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open bookmarks (⌘B)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"AI Chat settings","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Close","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"WORK, Google Account: lukas.kovalik@jiminny.com","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Main menu","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Chat","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Open menu for conversation actions.","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXHeading","text":"Conversation with Gemini","depth":15,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Conversation with Gemini","depth":16,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Expand","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"This Python script defines the","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator","depth":25,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Core Functionality","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Core Functionality","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Data Collection (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"collect_test_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"* Supports fetching raw samples from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Elasticsearch (ES)","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or loading them from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"local JSON files","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Implements a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"replayability check","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":": It probes the live API endpoint to ensure the activity exists before including it in the dataset.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Filters data based on request body size, language, and the presence of an original response.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Ground Truth Management:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Saves inputs and \"gold standard\" responses to a local directory structure (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"eval_output/ai_activity_type/...","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":").","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Features an optional mode to populate ground truth using a high-order LLM (e.g.,","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"gpt-5_2","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") if the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"POPULATE_GROUND_TRUTH_LLM","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"flag is set.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Evaluation Logic (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"evaluate_against_ground_truth","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Compares the model's predicted","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ai_activity_type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"against either the original ES response or the LLM-generated ground truth.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Assigns a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"quality_score","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"of","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"1.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a match and","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"0.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a mismatch or request failure.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Key Configuration Constants","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Key Configuration Constants","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy table","depth":24,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Execution Workflow","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Execution Workflow","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Parsing:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script uses","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"create_parser","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to handle CLI arguments like","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--data-set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--populate-ground-truth-llm","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", and","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--compare-to-ground-truth","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Initialization:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sets up directory paths for inputs and ground truths based on the selected dataset name.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Run:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Executes the evaluation via","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.run(run_evaluation(...))","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", which orchestrates the collection, endpoint calling, and result scoring phases.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Edit","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said are there soem issues with this PR","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"are there soem issues with this PR","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Show thinking","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Here is a breakdown of the most notable issues, categorized by impact:","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Performance & Async Anti-Patterns","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Performance & Async Anti-Patterns","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sequential Requests Defeating Async (Lines 86-106):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Inside","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"_collect_from_es","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", the code loops through","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"raw_samples","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"and","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"await","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"s a","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"session.post","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiohttp","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for asynchronous concurrency.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Collect these coroutines into a list and use","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.gather()","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"(ideally wrapped in an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.Semaphore","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to avoid overwhelming the endpoint) to run the replayability probes concurrently.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Standard synchronous file operations (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"with open(...)","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"json.dump","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"json.load","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") are used inside","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"async def","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"functions. In Python, this blocks the entire event loop while the disk is reading/writing.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Use an asynchronous file I/O library like","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiofiles","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", or run the blocking I/O operations in a separate thread using","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.to_thread()","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"bounds":{"left":0.0013888889,"top":0.08111111,"width":0.0027777778,"height":0.023333333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Architecture & State Management","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Architecture & State Management","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Mutating Class-Level Variables (Lines 323-328):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script parses CLI arguments and directly mutates the class attributes (e.g.,","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator.DATA_SET = args.data_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Pass these arguments into the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"__init__","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"constructor of the instance (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"self.data_set = args.data_set","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") rather than modifying the class definition globally.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Code Quality & Edge Cases","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Code Quality & Edge Cases","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Broad Exception Catching (Line 107):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Using a bare","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"except Exception:","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"during the replayability probe is dangerous. It will catch","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiohttp.ClientError","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"(which is intended), but it will also silently swallow","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"TimeoutError","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or even memory errors, making debugging incredibly difficult if the endpoint goes down.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Scope the exception to specific expected errors, like","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"except (aiohttp.ClientError, asyncio.TimeoutError):","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"bounds":{"left":0.0,"top":1.0,"width":0.0027777778,"height":-0.04111111},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Inefficient Size Calculation (Line 95):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"len(json.dumps(request_body)) < min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"requires serializing the entire dictionary to a string","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"just","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to check its length. Since","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"defaults to","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"0","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", the","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"if min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"short-circuit saves it for now, but if anyone sets a minimum size, this will become an unnecessary CPU bottleneck on large datasets.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Potential Path Traversal Risk (Lines 110, 113, 251):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script writes and reads files directly using","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"f\"{es_id}.json\"","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"and","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"f\"gt_{req_id}.json\"","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":". If","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es_id","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"req_id","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"originate from a database and ever contain characters like","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"../","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", it could result in files being read from or written to unintended directories.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sanitize the IDs or strictly enforce that they only contain alphanumeric characters.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Would you like me to rewrite the","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"_collect_from_es","depth":25,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"method to show how to fix the sequential async loop and blocking file I/O?","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXTextArea","text":"Enter a prompt for Gemini\nencrypted","depth":20,"value":"Enter a prompt for Gemini\nencrypted","help_text":"","role_description":"text entry area","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Enter a prompt for Gemini","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"encrypted","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Open upload file menu","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Tools","depth":18,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Open mode picker","depth":20,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false}]...
|
-8457584461451125948
|
6426478565828374791
|
visual_change
|
accessibility
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Close tab
Inbox (1,540) - [EMAIL] - Jiminny Mail
Inbox (1,540) - [EMAIL] - Jiminny Mail
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow
Parsing:
The script uses
create_parser
to handle CLI arguments like
--data-set
,
--populate-ground-truth-llm
, and
--compare-to-ground-truth
.
Initialization:
Sets up directory paths for inputs and ground truths based on the selected dataset name.
Run:
Executes the evaluation via
asyncio.run(run_evaluation(...))
, which orchestrates the collection, endpoint calling, and result scoring phases.
Copy prompt
Edit
You said are there soem issues with this PR
You said
are there soem issues with this PR
Listen
Show more options
Show thinking
Gemini said
Gemini said
Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.
Here is a breakdown of the most notable issues, categorized by impact:
Performance & Async Anti-Patterns
Performance & Async Anti-Patterns
Sequential Requests Defeating Async (Lines 86-106):
Inside
_collect_from_es
, the code loops through
raw_samples
and
await
s a
session.post
request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using
aiohttp
for asynchronous concurrency.
Fix:
Collect these coroutines into a list and use
asyncio.gather()
(ideally wrapped in an
asyncio.Semaphore
to avoid overwhelming the endpoint) to run the replayability probes concurrently.
Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):
Standard synchronous file operations (
with open(...)
,
json.dump
,
json.load
) are used inside
async def
functions. In Python, this blocks the entire event loop while the disk is reading/writing.
Fix:
Use an asynchronous file I/O library like
aiofiles
, or run the blocking I/O operations in a separate thread using
asyncio.to_thread()
.
Architecture & State Management
Architecture & State Management
Mutating Class-Level Variables (Lines 323-328):
The script parses CLI arguments and directly mutates the class attributes (e.g.,
AiActivityTypeEvaluator.DATA_SET = args.data_set
). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.
Fix:
Pass these arguments into the
__init__
constructor of the instance (
self.data_set = args.data_set
) rather than modifying the class definition globally.
Code Quality & Edge Cases
Code Quality & Edge Cases
Broad Exception Catching (Line 107):
Using a bare
except Exception:
during the replayability probe is dangerous. It will catch
aiohttp.ClientError
(which is intended), but it will also silently swallow
TimeoutError
or even memory errors, making debugging incredibly difficult if the endpoint goes down.
Fix:
Scope the exception to specific expected errors, like
except (aiohttp.ClientError, asyncio.TimeoutError):
.
Inefficient Size Calculation (Line 95):
len(json.dumps(request_body)) < min_size
requires serializing the entire dictionary to a string
just
to check its length. Since
min_size
defaults to
0
, the
if min_size
short-circuit saves it for now, but if anyone sets a minimum size, this will become an unnecessary CPU bottleneck on large datasets.
Potential Path Traversal Risk (Lines 110, 113, 251):
The script writes and reads files directly using
f"{es_id}.json"
and
f"gt_{req_id}.json"
. If
es_id
or
req_id
originate from a database and ever contain characters like
../
, it could result in files being read from or written to unintended directories.
Fix:
Sanitize the IDs or strictly enforce that they only contain alphanumeric characters.
Would you like me to rewrite the
_collect_from_es
method to show how to fix the sequential async loop and blocking file I/O?
Enter a prompt for Gemini
encrypted
Enter a prompt for Gemini
encrypted
Open upload file menu
Tools
Open mode picker...
|
8644
|
|
8644
|
166
|
19
|
2026-04-14T06:57:47.519689+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149867519_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)SInbox (2)SRD-677XC Jy 19798(7 Service-oi Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Sminm0 Projects# AWSE3 SSHEl DutadogPlatform TeamQ Search board800$ Jy-19240 / A JY-18900READY FOR DEV 1ooeeheTxWirtexceed Fontawesomepackage limitsMAINTENANCISReady for Dev©-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium|% JY-20626 Integration testing= MediumLinked work itemsis cloned byJY-20458 CRUD reports with Ask. JiminnyIs tested byJY-20649 Generate Ask Jiminny Report and display it in AI Reports• JY-20656 Expired Ask Jiminny report template is automatically disabled an…..© JY-20660 Non-creator cannot delete generated Ask Jiminny report from A...JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible only to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..•**_AssigneeStatus3 Steliyan ...i2 Stellyan -INDEYVINDEV vAnellya A...& Unassign...READY FOR DEVREADY FOR DEVSDEPLOYEDVDONEDONEDONEDONEDONEMDONEWUnEActions9:57 AM Daily - Platform> 0.Daily - Platform • 8 m left8•ue 14 Apr 9:57CE: DepeDependol• Far DepeD Tasks7 Insights & Coachin• Del0ux.D Al Bookmarks© 2XZ Improve Story~ DetailsAssignee32 Stellyan GeorgievAsuntoneReporter2 Galya DimitrovaUerClo eiene@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction7 See all deploymentsComponentsPlatformSub-ProductAooton5 days agcUrEhGroup: QueriesDEPLOY &fix connome ouring teleature fiag releaseDeployed0E -20585 15 + ****=AI Review - Q1CTNIHEMANTANORCODeployed19982 1No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed8 -20384 05 11 ****=Missed deletion of stale0 m40048009(Pipedrive) DI > status notlabeled, date data missingREDUCE CHURNNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan GeorgievKovalik12:238 • Tue 14 Apr 9:57:47ChatNikolay YankovNikolay Ivanov...
|
NULL
|
3602653603668989296
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)SInbox (2)SRD-677XC Jy 19798(7 Service-oi Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Sminm0 Projects# AWSE3 SSHEl DutadogPlatform TeamQ Search board800$ Jy-19240 / A JY-18900READY FOR DEV 1ooeeheTxWirtexceed Fontawesomepackage limitsMAINTENANCISReady for Dev©-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium|% JY-20626 Integration testing= MediumLinked work itemsis cloned byJY-20458 CRUD reports with Ask. JiminnyIs tested byJY-20649 Generate Ask Jiminny Report and display it in AI Reports• JY-20656 Expired Ask Jiminny report template is automatically disabled an…..© JY-20660 Non-creator cannot delete generated Ask Jiminny report from A...JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible only to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..•**_AssigneeStatus3 Steliyan ...i2 Stellyan -INDEYVINDEV vAnellya A...& Unassign...READY FOR DEVREADY FOR DEVSDEPLOYEDVDONEDONEDONEDONEDONEMDONEWUnEActions9:57 AM Daily - Platform> 0.Daily - Platform • 8 m left8•ue 14 Apr 9:57CE: DepeDependol• Far DepeD Tasks7 Insights & Coachin• Del0ux.D Al Bookmarks© 2XZ Improve Story~ DetailsAssignee32 Stellyan GeorgievAsuntoneReporter2 Galya DimitrovaUerClo eiene@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction7 See all deploymentsComponentsPlatformSub-ProductAooton5 days agcUrEhGroup: QueriesDEPLOY &fix connome ouring teleature fiag releaseDeployed0E -20585 15 + ****=AI Review - Q1CTNIHEMANTANORCODeployed19982 1No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed8 -20384 05 11 ****=Missed deletion of stale0 m40048009(Pipedrive) DI > status notlabeled, date data missingREDUCE CHURNNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan GeorgievKovalik12:238 • Tue 14 Apr 9:57:47ChatNikolay YankovNikolay Ivanov...
|
NULL
|
|
8642
|
166
|
18
|
2026-04-14T06:57:44.468996+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149864468_m1.jpg...
|
Firefox
|
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf — Work...
|
1
|
app.staging.jiminny.com/ai-reports/pdf/e2f4890a-d0 app.staging.jiminny.com/ai-reports/pdf/e2f4890a-d097-495b-8dfd-d4c25a5e8843...
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Close tab
Inbox (1,540) - [EMAIL] - Jiminny Mail
Inbox (1,540) - [EMAIL] - Jiminny Mail
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow
Parsing:
The script uses
create_parser
to handle CLI arguments like
--data-set
,
--populate-ground-truth-llm
, and
--compare-to-ground-truth
.
Initialization:
Sets up directory paths for inputs and ground truths based on the selected dataset name.
Run:
Executes the evaluation via
asyncio.run(run_evaluation(...))
, which orchestrates the collection, endpoint calling, and result scoring phases.
Copy prompt
Edit
You said are there soem issues with this PR
You said
are there soem issues with this PR
Listen
Show more options
Show thinking
Gemini said
Gemini said
Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.
Here is a breakdown of the most notable issues, categorized by impact:
Performance & Async Anti-Patterns
Performance & Async Anti-Patterns
Sequential Requests Defeating Async (Lines 86-106):
Inside
_collect_from_es
, the code loops through
raw_samples
and
await
s a
session.post
request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using
aiohttp
for asynchronous concurrency.
Fix:
Collect these coroutines into a list and use
asyncio.gather()
(ideally wrapped in an
asyncio.Semaphore
to avoid overwhelming the endpoint) to run the replayability probes concurrently.
Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):
Standard synchronous file operations (
with open(...)
,
json.dump
,
json.load
) are used inside
async def
functions. In Python, this blocks the entire event loop while the disk is reading/writing.
Fix:
Use an asynchronous file I/O library like
aiofiles
, or run the blocking I/O operations in a separate thread using
asyncio.to_thread()
.
Architecture & State Management
Architecture & State Management
Mutating Class-Level Variables (Lines 323-328):
The script parses CLI arguments and directly mutates the class attributes (e.g.,
AiActivityTypeEvaluator.DATA_SET = args.data_set
). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.
Fix:
Pass these arguments into the
__init__
constructor of the instance (
self.data_set = args.data_set
) rather than modifying the class definition globally.
Code Quality & Edge Cases
Code Quality & Edge Cases
Broad Exception Catching (Line 107):
Using a bare
except Exception:
during the replayability probe is dangerous. It will catch
aiohttp.ClientError
(which is intended), but it will also silently swallow
TimeoutError
or even memory errors, making debugging incredibly difficult if the endpoint goes down.
Fix:
Scope the exception to specific expected errors, like
except (aiohttp.ClientError, asyncio.TimeoutError):
.
Inefficient Size Calculation (Line 95):
len(json.dumps(request_body)) < min_size
requires serializing the entire dictionary to a string
just
to check its length. Since
min_size
defaults to...
|
[{"role":"AXRadioButton","text [{"role":"AXRadioButton","text":"JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jiminny","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jiminny","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":true},{"role":"AXStaticText","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Close tab","depth":5,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Inbox (1,540) - lukas.kovalik@jiminny.com - Jiminny Mail","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Inbox (1,540) - lukas.kovalik@jiminny.com - Jiminny Mail","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"New Tab","depth":4,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Customize sidebar","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Close Google Gemini (⌃X)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Tabs from other devices","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open history (⇧⌘H)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open bookmarks (⌘B)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"AI Chat settings","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Close","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"WORK, Google Account: lukas.kovalik@jiminny.com","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Main menu","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Chat","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Open menu for conversation actions.","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXHeading","text":"Conversation with Gemini","depth":15,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Conversation with Gemini","depth":16,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Expand","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"This Python script defines the","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator","depth":25,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Core Functionality","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Core Functionality","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Data Collection (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"collect_test_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"* Supports fetching raw samples from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Elasticsearch (ES)","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or loading them from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"local JSON files","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Implements a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"replayability check","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":": It probes the live API endpoint to ensure the activity exists before including it in the dataset.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Filters data based on request body size, language, and the presence of an original response.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Ground Truth Management:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Saves inputs and \"gold standard\" responses to a local directory structure (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"eval_output/ai_activity_type/...","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":").","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Features an optional mode to populate ground truth using a high-order LLM (e.g.,","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"gpt-5_2","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") if the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"POPULATE_GROUND_TRUTH_LLM","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"flag is set.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Evaluation Logic (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"evaluate_against_ground_truth","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Compares the model's predicted","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ai_activity_type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"against either the original ES response or the LLM-generated ground truth.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Assigns a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"quality_score","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"of","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"1.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a match and","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"0.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a mismatch or request failure.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Key Configuration Constants","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Key Configuration Constants","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy table","depth":24,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Execution Workflow","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Execution Workflow","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Parsing:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script uses","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"create_parser","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to handle CLI arguments like","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--data-set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--populate-ground-truth-llm","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", and","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--compare-to-ground-truth","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Initialization:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sets up directory paths for inputs and ground truths based on the selected dataset name.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Run:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Executes the evaluation via","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.run(run_evaluation(...))","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", which orchestrates the collection, endpoint calling, and result scoring phases.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Edit","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said are there soem issues with this PR","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"are there soem issues with this PR","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Show thinking","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Here is a breakdown of the most notable issues, categorized by impact:","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Performance & Async Anti-Patterns","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Performance & Async Anti-Patterns","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sequential Requests Defeating Async (Lines 86-106):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Inside","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"_collect_from_es","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", the code loops through","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"raw_samples","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"and","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"await","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"s a","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"session.post","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiohttp","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for asynchronous concurrency.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Collect these coroutines into a list and use","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.gather()","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"(ideally wrapped in an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.Semaphore","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to avoid overwhelming the endpoint) to run the replayability probes concurrently.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Standard synchronous file operations (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"with open(...)","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"json.dump","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"json.load","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") are used inside","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"async def","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"functions. In Python, this blocks the entire event loop while the disk is reading/writing.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Use an asynchronous file I/O library like","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiofiles","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", or run the blocking I/O operations in a separate thread using","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.to_thread()","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"bounds":{"left":0.0013888889,"top":0.08111111,"width":0.0027777778,"height":0.023333333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Architecture & State Management","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Architecture & State Management","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Mutating Class-Level Variables (Lines 323-328):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script parses CLI arguments and directly mutates the class attributes (e.g.,","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator.DATA_SET = args.data_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Pass these arguments into the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"__init__","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"constructor of the instance (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"self.data_set = args.data_set","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") rather than modifying the class definition globally.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Code Quality & Edge Cases","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Code Quality & Edge Cases","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Broad Exception Catching (Line 107):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Using a bare","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"except Exception:","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"during the replayability probe is dangerous. It will catch","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiohttp.ClientError","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"(which is intended), but it will also silently swallow","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"TimeoutError","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or even memory errors, making debugging incredibly difficult if the endpoint goes down.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Scope the exception to specific expected errors, like","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"except (aiohttp.ClientError, asyncio.TimeoutError):","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"bounds":{"left":0.0,"top":1.0,"width":0.0027777778,"height":-0.04111111},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Inefficient Size Calculation (Line 95):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"len(json.dumps(request_body)) < min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"requires serializing the entire dictionary to a string","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"just","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to check its length. Since","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"defaults to","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"}]...
|
-2553411110187335245
|
6426478565827391767
|
visual_change
|
accessibility
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Close tab
Inbox (1,540) - [EMAIL] - Jiminny Mail
Inbox (1,540) - [EMAIL] - Jiminny Mail
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow
Parsing:
The script uses
create_parser
to handle CLI arguments like
--data-set
,
--populate-ground-truth-llm
, and
--compare-to-ground-truth
.
Initialization:
Sets up directory paths for inputs and ground truths based on the selected dataset name.
Run:
Executes the evaluation via
asyncio.run(run_evaluation(...))
, which orchestrates the collection, endpoint calling, and result scoring phases.
Copy prompt
Edit
You said are there soem issues with this PR
You said
are there soem issues with this PR
Listen
Show more options
Show thinking
Gemini said
Gemini said
Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.
Here is a breakdown of the most notable issues, categorized by impact:
Performance & Async Anti-Patterns
Performance & Async Anti-Patterns
Sequential Requests Defeating Async (Lines 86-106):
Inside
_collect_from_es
, the code loops through
raw_samples
and
await
s a
session.post
request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using
aiohttp
for asynchronous concurrency.
Fix:
Collect these coroutines into a list and use
asyncio.gather()
(ideally wrapped in an
asyncio.Semaphore
to avoid overwhelming the endpoint) to run the replayability probes concurrently.
Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):
Standard synchronous file operations (
with open(...)
,
json.dump
,
json.load
) are used inside
async def
functions. In Python, this blocks the entire event loop while the disk is reading/writing.
Fix:
Use an asynchronous file I/O library like
aiofiles
, or run the blocking I/O operations in a separate thread using
asyncio.to_thread()
.
Architecture & State Management
Architecture & State Management
Mutating Class-Level Variables (Lines 323-328):
The script parses CLI arguments and directly mutates the class attributes (e.g.,
AiActivityTypeEvaluator.DATA_SET = args.data_set
). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.
Fix:
Pass these arguments into the
__init__
constructor of the instance (
self.data_set = args.data_set
) rather than modifying the class definition globally.
Code Quality & Edge Cases
Code Quality & Edge Cases
Broad Exception Catching (Line 107):
Using a bare
except Exception:
during the replayability probe is dangerous. It will catch
aiohttp.ClientError
(which is intended), but it will also silently swallow
TimeoutError
or even memory errors, making debugging incredibly difficult if the endpoint goes down.
Fix:
Scope the exception to specific expected errors, like
except (aiohttp.ClientError, asyncio.TimeoutError):
.
Inefficient Size Calculation (Line 95):
len(json.dumps(request_body)) < min_size
requires serializing the entire dictionary to a string
just
to check its length. Since
min_size
defaults to...
|
8640
|
|
8640
|
166
|
17
|
2026-04-14T06:57:38.421835+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149858421_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19796(7 Service-oie Font. AmrCostI ChActionshttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=Jy-18909• SminmProjects# AWSE SSHEl DatadogPlatform TeamQ Search board800$ Jy-19240 / A JY-18900READY FOR DEV 1oohenetxwirtexceed Fontawesomepackage limitsMAINTENANCSReady for Dev8-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium|% JY-20626 Integration testing= MediumLinked work itemsis cloned byJY-20458 CRUD reports with Ask. Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an..JY-20660 Non-creator cannot delete generated Ask Jiminny report from A.JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar..JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..* Claude3 Circieci•"*_AssigneeStatus32 Steliyan ...32 Stellyan .-.INOEYVINDEV vAnellya A...& Unassign...READY FOR DEVREADY FOR DEVSDEPLOYEDVDONEDONE MDONEDONE MDONEDONEWUnE9:57 AM Daily - Platform> 0.Daily - Platform • 8 m leftС Ct: DepeActionsDependolD Tasks2 Insights & Coachin• DevDux.© 2In Dev vXz Improve StoryDetailsAssignee3 Stellyan GeorgievReporter2 Galya DimitrovaUeyele enecne@ Open with VS Code|3 branches32 commits1 pull request1 build failing5 days agcUrEhRelessesProductionSee all deploymentsComponentsPlatformSub-ProductAootcons8•• Far DepelGroup: Queriesfix connume ouring toleature fiag releaseDeployed0E -20585 15 * ****=AI Review - Q1CTNIHEWANANORCONDeployed-199821 @ ••=•=No such column "Type' onentty 'Event"PLATTORM STABILITYDeployed2038405 +600=Missed deletion of staleadsese(Pipedrive) DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=D Al BookmarksStefka Stoyanovaiyan GeorgievKovalik12:138 • Tue 14 Apr 9:57:37ChatNikolay YankovNikolay Ivanov...
|
NULL
|
-2193002351411712431
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19796(7 Service-oie Font. AmrCostI ChActionshttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=Jy-18909• SminmProjects# AWSE SSHEl DatadogPlatform TeamQ Search board800$ Jy-19240 / A JY-18900READY FOR DEV 1oohenetxwirtexceed Fontawesomepackage limitsMAINTENANCSReady for Dev8-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium|% JY-20626 Integration testing= MediumLinked work itemsis cloned byJY-20458 CRUD reports with Ask. Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an..JY-20660 Non-creator cannot delete generated Ask Jiminny report from A.JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar..JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..* Claude3 Circieci•"*_AssigneeStatus32 Steliyan ...32 Stellyan .-.INOEYVINDEV vAnellya A...& Unassign...READY FOR DEVREADY FOR DEVSDEPLOYEDVDONEDONE MDONEDONE MDONEDONEWUnE9:57 AM Daily - Platform> 0.Daily - Platform • 8 m leftС Ct: DepeActionsDependolD Tasks2 Insights & Coachin• DevDux.© 2In Dev vXz Improve StoryDetailsAssignee3 Stellyan GeorgievReporter2 Galya DimitrovaUeyele enecne@ Open with VS Code|3 branches32 commits1 pull request1 build failing5 days agcUrEhRelessesProductionSee all deploymentsComponentsPlatformSub-ProductAootcons8•• Far DepelGroup: Queriesfix connume ouring toleature fiag releaseDeployed0E -20585 15 * ****=AI Review - Q1CTNIHEWANANORCONDeployed-199821 @ ••=•=No such column "Type' onentty 'Event"PLATTORM STABILITYDeployed2038405 +600=Missed deletion of staleadsese(Pipedrive) DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=D Al BookmarksStefka Stoyanovaiyan GeorgievKovalik12:138 • Tue 14 Apr 9:57:37ChatNikolay YankovNikolay Ivanov...
|
NULL
|
|
8638
|
166
|
16
|
2026-04-14T06:57:37.300197+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149857300_m1.jpg...
|
Firefox
|
Jiminny — Work
|
1
|
app.staging.jiminny.com/ai-reports
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Close tab
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Close tab
Inbox (1,540) - [EMAIL] - Jiminny Mail
Inbox (1,540) - [EMAIL] - Jiminny Mail
New Tab...
|
[{"role":"AXRadioButton","text [{"role":"AXRadioButton","text":"JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jiminny","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":true},{"role":"AXStaticText","text":"Jiminny","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Close tab","depth":5,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Close tab","depth":5,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Inbox (1,540) - lukas.kovalik@jiminny.com - Jiminny Mail","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Inbox (1,540) - lukas.kovalik@jiminny.com - Jiminny Mail","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"New Tab","depth":4,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false}]...
|
-5107577173204566284
|
1545677505700740271
|
click
|
accessibility
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Close tab
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Close tab
Inbox (1,540) - [EMAIL] - Jiminny Mail
Inbox (1,540) - [EMAIL] - Jiminny Mail
New Tab...
|
NULL
|
|
8637
|
166
|
15
|
2026-04-14T06:57:35.371970+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149855371_m1.jpg...
|
Dia
|
Work: Meet - Daily - Pla…
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
Nikolay Yankov (Presenting)
Nikolay Yankov (Presen Nikolay Yankov (Presenting)
Nikolay Yankov (Presenting)
People
6
Take notes with Gemini
Take notes with Gemini
Gemini
Gemini
Unpin Nikolay Yankov's presentation from your main screen
You can't unmute someone else's presentation
More options for Nikolay Yankov
Zoom in
Open in new window
Enter Full Screen
Pin Stefka Stoyanova to your main screen
Mute Stefka Stoyanova's microphone
More options for Stefka Stoyanova
Stefka Stoyanova
Pin Nikolay Yankov to your main screen
Mute Nikolay Yankov's microphone
More options for Nikolay Yankov
Nikolay Yankov
Pin Steliyan Georgiev to your main screen
You can't unmute someone else
More options for Steliyan Georgiev
Steliyan Georgiev
Pin Nikolay Ivanov to your main screen
You can't unmute someone else
More options for Nikolay Ivanov
Nikolay Ivanov
You’re continuously framed
Backgrounds and effects
More options for Lukas Kovalik
Lukas Kovalik
Others might see more of your background. Click to view your full video.
9:57
AM
Daily - Platform
Daily - Platform
Audio settings
Turn on microphone
Video settings
Turn off camera
Nikolay Yankov is presenting
Send a reaction
Turn on captions
Raise hand (ctrl + ⌘ + h)
More options
Leave call
Meeting details
Chat with everyone
Meeting tools
Your microphone is off.
meet.google.com / Daily - Platform...
|
[{"role":"AXHeading","text" [{"role":"AXHeading","text":"Nikolay Yankov (Presenting)","depth":13,"bounds":{"left":0.045833334,"top":0.11,"width":0.125,"height":0.022222223},"role_description":"heading"},{"role":"AXStaticText","text":"Nikolay Yankov (Presenting)","depth":14,"bounds":{"left":0.045833334,"top":0.11111111,"width":0.125,"height":0.02},"role_description":"text"},{"role":"AXPopUpButton","text":"People","depth":15,"bounds":{"left":0.8826389,"top":0.09888889,"width":0.04097222,"height":0.04},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"6","depth":18,"bounds":{"left":0.91041666,"top":0.11,"width":0.0048611113,"height":0.016666668},"role_description":"text"},{"role":"AXPopUpButton","text":"Take notes with Gemini","depth":15,"bounds":{"left":0.9291667,"top":0.09888889,"width":0.025,"height":0.04},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Take notes with Gemini","depth":19,"bounds":{"left":0.93194443,"top":0.11,"width":0.022222223,"height":0.016666668},"role_description":"text"},{"role":"AXStaticText","text":"Gemini","depth":19,"bounds":{"left":0.9625,"top":0.11,"width":0.022222223,"height":0.016666668},"role_description":"text"},{"role":"AXButton","text":"Gemini","depth":18,"bounds":{"left":0.9604167,"top":0.1,"width":0.023611112,"height":0.037777778},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"Unpin Nikolay Yankov's presentation from your main screen","depth":15,"bounds":{"left":0.3263889,"top":0.51,"width":0.027777778,"height":0.044444446},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"You can't unmute someone else's presentation","depth":15,"bounds":{"left":0.35416666,"top":0.50777775,"width":0.030555556,"height":0.04888889},"role_description":"button","is_enabled":false,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"More options for Nikolay Yankov","depth":16,"bounds":{"left":0.38472223,"top":0.51,"width":0.027777778,"height":0.044444446},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Zoom in","depth":14,"bounds":{"left":0.6201389,"top":0.7922222,"width":0.027777778,"height":0.044444446},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Open in new window","depth":15,"bounds":{"left":0.65347224,"top":0.7922222,"width":0.027777778,"height":0.044444446},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Enter Full Screen","depth":15,"bounds":{"left":0.68680555,"top":0.7922222,"width":0.027777778,"height":0.044444446},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"Pin Stefka Stoyanova to your main screen","depth":15,"bounds":{"left":0.75208336,"top":0.25777778,"width":0.027777778,"height":0.044444446},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Mute Stefka Stoyanova's microphone","depth":15,"bounds":{"left":0.7798611,"top":0.25555557,"width":0.030555556,"height":0.04888889},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"More options for Stefka Stoyanova","depth":16,"bounds":{"left":0.81041664,"top":0.25777778,"width":0.027777778,"height":0.044444446},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Stefka Stoyanova","depth":17,"bounds":{"left":0.7430556,"top":0.36666667,"width":0.08888889,"height":0.022222223},"role_description":"text"},{"role":"AXPopUpButton","text":"Pin Nikolay Yankov to your main screen","depth":15,"bounds":{"left":0.88125,"top":0.25777778,"width":0.027777778,"height":0.044444446},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Mute Nikolay Yankov's microphone","depth":15,"bounds":{"left":0.90902776,"top":0.25555557,"width":0.030555556,"height":0.04888889},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"More options for Nikolay Yankov","depth":16,"bounds":{"left":0.93958336,"top":0.25777778,"width":0.027777778,"height":0.044444446},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Nikolay Yankov","depth":17,"bounds":{"left":0.87222224,"top":0.36666667,"width":0.077083334,"height":0.022222223},"role_description":"text"},{"role":"AXPopUpButton","text":"Pin Steliyan Georgiev to your main screen","depth":15,"bounds":{"left":0.75208336,"top":0.51,"width":0.027777778,"height":0.044444446},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"You can't unmute someone else","depth":15,"bounds":{"left":0.7798611,"top":0.50777775,"width":0.030555556,"height":0.04888889},"role_description":"button","is_enabled":false,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"More options for Steliyan Georgiev","depth":16,"bounds":{"left":0.81041664,"top":0.51,"width":0.027777778,"height":0.044444446},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Steliyan Georgiev","depth":17,"bounds":{"left":0.7430556,"top":0.6188889,"width":0.09097222,"height":0.022222223},"role_description":"text"},{"role":"AXPopUpButton","text":"Pin Nikolay Ivanov to your main screen","depth":15,"bounds":{"left":0.88125,"top":0.51,"width":0.027777778,"height":0.044444446},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"You can't unmute someone else","depth":15,"bounds":{"left":0.90902776,"top":0.50777775,"width":0.030555556,"height":0.04888889},"role_description":"button","is_enabled":false,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"More options for Nikolay Ivanov","depth":16,"bounds":{"left":0.93958336,"top":0.51,"width":0.027777778,"height":0.044444446},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Nikolay Ivanov","depth":17,"bounds":{"left":0.87222224,"top":0.6188889,"width":0.075,"height":0.022222223},"role_description":"text"},{"role":"AXButton","text":"You’re continuously framed","depth":15,"bounds":{"left":0.81527776,"top":0.76,"width":0.030555556,"height":0.04888889},"role_description":"button","is_enabled":false,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Backgrounds and effects","depth":15,"bounds":{"left":0.84583336,"top":0.76,"width":0.030555556,"height":0.04888889},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"More options for Lukas Kovalik","depth":16,"bounds":{"left":0.8763889,"top":0.76222223,"width":0.027777778,"height":0.044444446},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Lukas Kovalik","depth":17,"bounds":{"left":0.7423611,"top":0.8711111,"width":0.06944445,"height":0.022222223},"role_description":"text"},{"role":"AXButton","text":"Others might see more of your background. Click to view your full video.","depth":14,"bounds":{"left":0.9618056,"top":0.86777776,"width":0.018055556,"height":0.028888889},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"9:57","depth":10,"bounds":{"left":0.020833334,"top":0.93777776,"width":0.022222223,"height":0.022222223},"role_description":"text"},{"role":"AXStaticText","text":"AM","depth":10,"bounds":{"left":0.045833334,"top":0.93777776,"width":0.018055556,"height":0.022222223},"role_description":"text"},{"role":"AXHeading","text":"Daily - Platform","depth":10,"bounds":{"left":0.08125,"top":0.90444446,"width":0.08194444,"height":0.08888889},"role_description":"heading"},{"role":"AXStaticText","text":"Daily - Platform","depth":13,"bounds":{"left":0.08125,"top":0.93777776,"width":0.08194444,"height":0.022222223},"role_description":"text"},{"role":"AXPopUpButton","text":"Audio settings","depth":14,"bounds":{"left":0.30416667,"top":0.9222222,"width":0.06111111,"height":0.053333335},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Turn on microphone","depth":14,"bounds":{"left":0.33194444,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":true,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"Video settings","depth":14,"bounds":{"left":0.37083334,"top":0.9222222,"width":0.06111111,"height":0.053333335},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Turn off camera","depth":14,"bounds":{"left":0.3986111,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Nikolay Yankov is presenting","depth":12,"bounds":{"left":0.4375,"top":0.9222222,"width":0.03888889,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXCheckBox","text":"Send a reaction","depth":12,"bounds":{"left":0.48194444,"top":0.9222222,"width":0.03888889,"height":0.053333335},"role_description":"toggle button","subrole":"AXToggleButton","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Turn on captions","depth":11,"bounds":{"left":0.5263889,"top":0.9222222,"width":0.03888889,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXCheckBox","text":"Raise hand (ctrl + ⌘ + h)","depth":11,"bounds":{"left":0.5708333,"top":0.9222222,"width":0.03888889,"height":0.053333335},"role_description":"toggle button","subrole":"AXToggleButton","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"More options","depth":14,"bounds":{"left":0.61527777,"top":0.9222222,"width":0.025,"height":0.053333335},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Leave call","depth":11,"bounds":{"left":0.6458333,"top":0.9222222,"width":0.05,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Meeting details","depth":12,"bounds":{"left":0.8875,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Chat with everyone","depth":12,"bounds":{"left":0.92083335,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Meeting tools","depth":12,"bounds":{"left":0.95416665,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Your microphone is off.","depth":4,"bounds":{"left":0.004166667,"top":0.08111111,"width":0.00069444446,"height":0.0011111111},"role_description":"text"},{"role":"AXTextArea","text":"meet.google.com / Daily - Platform","depth":5,"bounds":{"left":0.11111111,"top":0.04111111,"width":0.21597221,"height":0.033333335},"automation_id":"navigationBarAssistantBarTextField","value":"meet.google.com / Daily - Platform","role_description":"text entry area","is_focused":false}]...
|
-926312337826985780
|
-3399732331133805800
|
visual_change
|
hybrid
|
NULL
|
Nikolay Yankov (Presenting)
Nikolay Yankov (Presen Nikolay Yankov (Presenting)
Nikolay Yankov (Presenting)
People
6
Take notes with Gemini
Take notes with Gemini
Gemini
Gemini
Unpin Nikolay Yankov's presentation from your main screen
You can't unmute someone else's presentation
More options for Nikolay Yankov
Zoom in
Open in new window
Enter Full Screen
Pin Stefka Stoyanova to your main screen
Mute Stefka Stoyanova's microphone
More options for Stefka Stoyanova
Stefka Stoyanova
Pin Nikolay Yankov to your main screen
Mute Nikolay Yankov's microphone
More options for Nikolay Yankov
Nikolay Yankov
Pin Steliyan Georgiev to your main screen
You can't unmute someone else
More options for Steliyan Georgiev
Steliyan Georgiev
Pin Nikolay Ivanov to your main screen
You can't unmute someone else
More options for Nikolay Ivanov
Nikolay Ivanov
You’re continuously framed
Backgrounds and effects
More options for Lukas Kovalik
Lukas Kovalik
Others might see more of your background. Click to view your full video.
9:57
AM
Daily - Platform
Daily - Platform
Audio settings
Turn on microphone
Video settings
Turn off camera
Nikolay Yankov is presenting
Send a reaction
Turn on captions
Raise hand (ctrl + ⌘ + h)
More options
Leave call
Meeting details
Chat with everyone
Meeting tools
Your microphone is off.
meet.google.com / Daily - Platform
DiaFileEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677X(7 (5RD-67)C Jy 19796(7 Service-oie Font. AmrCostI Chhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=Jy-18909• SminmProjects# AWSE SSHPlatform TeamQ Search board800$ Jy-19240 / A JY-18900READY FOR DEV 1exceed Fontawesomepackage limitsMAINTENANCSReady for Dev8-20564 1 П ****=% JY-20574 [AI) Ensure PDF formatting is good% JY-20586 (QA) Manual Testing% JY-20626 Integration testingLinked work itemsis cloned byJY-20458 CRUD reports with Ask. Jiminnyis tested byEl Datadog3 CircieciPriority= Medium= Medium32 SteliyanSteliyan...= Medii= Medi4h8 UnaDEPLOYEDVDONEJY-20656 Expired Ask Jiminny report template is automatically disabled an.DONE MJY-20660 Non-creator cannot delete generated Ask Jiminny report from A.DONEJY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...DONE MJY-20655 Generated Ask Jiminny Report emall is sent to recipientsDONE0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and sihar......DONEJY-20650 Generated Ask Jiminny Report PDF includes branding, report se..WUnEActionsStatusINOEYVINDEV vREADY FOR DEV <READY FOR DEVS9:57 AM Daily - PlatformlaalDaily - Platform - 8 m left100% <478• Tue 14 Apr 9:57:34ChatС Ct: DepeQ Actions sDependolD Tasks7 Insights & Coachin• DelDux.Xz Improve Story~ DetailsAssignee3 Stellyan GeorgievReporter2 Galya DimitrovaUeyele enecne@ Open with VS Code|3 branches32 commits1 pull request1 build failing5 days agcUrEhRelessesProductionSee all deploymentsComponentsPlatformSub-ProductAootcons8•• Far DepelD Al BookmarksGroup: Querieshix connoie ouring toDeployed0 -20ses 15 4 •***=•AI Review - Q1CTNIHEWANANORCONDeployed-19982 1 @ •..•=No such column "Type' onentty 'Event"PLATTORM STABTLITYDeployed8 -20384 05 ^ ****=|Missed deletion of staleadsese(Pipedrive) DI > status notQStefka StoyanovaNikolay YankovSteliyan GeorgievNikolay IvanovKovalik:12:10...
|
8635
|
|
8635
|
166
|
14
|
2026-04-14T06:57:24.671631+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149844671_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFileEditViewTabsBookmarksHistoryExtensionsWindo DiaFileEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)SRD-677XF7 (SRD-67C Jy 19798(7 Service-oir Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsE3 SSHEl DatadogPlatform Team200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=Priority% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byA JY-20458 CRUD reports with Ask Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..AssigneeStatus3 Steliyan...i2 Stellyan .-.INDEYVINDEV vAnellya A...READY FOR DEV <& Unassign...READY FOR DEVSDEPLOYEDDONEDONE MDONEDONE MDONEDONEWUnEActions9:57 AM Daily - PlatformlaalDaily - Platform - 8 m left100% <478• Tue 14 Apr 9:57:246ChatCE: DeperDependolD TasksInsights & CoachinD DevD ux.© 2Xz Improve Story~ Details32 Stellyan GeorgievAsentoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failing5 days agcRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons8•• Far DepeGroup: QueriesDEPLOY &leature fiag reieaseDeployed0 -20585 15 * ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 1 0 •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 ^ ****=|Missed deletion of stale408099(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:D Al BookmarksStefka StoyanovaNikolay Yankovan GeorgievNikolay Ivanovovalik12:00...
|
NULL
|
-1896545725656580030
|
NULL
|
click
|
ocr
|
NULL
|
DiaFileEditViewTabsBookmarksHistoryExtensionsWindo DiaFileEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)SRD-677XF7 (SRD-67C Jy 19798(7 Service-oir Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsE3 SSHEl DatadogPlatform Team200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=Priority% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byA JY-20458 CRUD reports with Ask Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..AssigneeStatus3 Steliyan...i2 Stellyan .-.INDEYVINDEV vAnellya A...READY FOR DEV <& Unassign...READY FOR DEVSDEPLOYEDDONEDONE MDONEDONE MDONEDONEWUnEActions9:57 AM Daily - PlatformlaalDaily - Platform - 8 m left100% <478• Tue 14 Apr 9:57:246ChatCE: DeperDependolD TasksInsights & CoachinD DevD ux.© 2Xz Improve Story~ Details32 Stellyan GeorgievAsentoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failing5 days agcRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons8•• Far DepeGroup: QueriesDEPLOY &leature fiag reieaseDeployed0 -20585 15 * ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 1 0 •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 ^ ****=|Missed deletion of stale408099(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:D Al BookmarksStefka StoyanovaNikolay Yankovan GeorgievNikolay Ivanovovalik12:00...
|
NULL
|
|
8633
|
166
|
13
|
2026-04-14T06:57:23.760141+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149843760_m1.jpg...
|
Dia
|
Work: Meet - Daily - Pla…
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
Nikolay Yankov (Presenting)
Nikolay Yankov (Presen Nikolay Yankov (Presenting)
Nikolay Yankov (Presenting)
People
6
Take notes with Gemini
Take notes with Gemini
Gemini
Gemini
Zoom in
Open in new window
Enter Full Screen
Stefka Stoyanova...
|
[{"role":"AXHeading","text" [{"role":"AXHeading","text":"Nikolay Yankov (Presenting)","depth":13,"bounds":{"left":0.045833334,"top":0.11,"width":0.125,"height":0.022222223},"role_description":"heading"},{"role":"AXStaticText","text":"Nikolay Yankov (Presenting)","depth":14,"bounds":{"left":0.045833334,"top":0.11111111,"width":0.125,"height":0.02},"role_description":"text"},{"role":"AXPopUpButton","text":"People","depth":15,"bounds":{"left":0.8826389,"top":0.09888889,"width":0.04097222,"height":0.04},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"6","depth":18,"bounds":{"left":0.91041666,"top":0.11,"width":0.0048611113,"height":0.016666668},"role_description":"text"},{"role":"AXPopUpButton","text":"Take notes with Gemini","depth":15,"bounds":{"left":0.9291667,"top":0.09888889,"width":0.025,"height":0.04},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Take notes with Gemini","depth":19,"bounds":{"left":0.93194443,"top":0.11,"width":0.022222223,"height":0.016666668},"role_description":"text"},{"role":"AXStaticText","text":"Gemini","depth":19,"bounds":{"left":0.9625,"top":0.11,"width":0.022222223,"height":0.016666668},"role_description":"text"},{"role":"AXButton","text":"Gemini","depth":18,"bounds":{"left":0.9604167,"top":0.1,"width":0.023611112,"height":0.037777778},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Zoom in","depth":14,"bounds":{"left":0.6201389,"top":0.7922222,"width":0.027777778,"height":0.044444446},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Open in new window","depth":15,"bounds":{"left":0.65347224,"top":0.7922222,"width":0.027777778,"height":0.044444446},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Enter Full Screen","depth":15,"bounds":{"left":0.68680555,"top":0.7922222,"width":0.027777778,"height":0.044444446},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Stefka Stoyanova","depth":17,"bounds":{"left":0.7430556,"top":0.36666667,"width":0.08888889,"height":0.022222223},"role_description":"text"}]...
|
6319403837425361321
|
8077623044035967784
|
click
|
accessibility
|
NULL
|
Nikolay Yankov (Presenting)
Nikolay Yankov (Presen Nikolay Yankov (Presenting)
Nikolay Yankov (Presenting)
People
6
Take notes with Gemini
Take notes with Gemini
Gemini
Gemini
Zoom in
Open in new window
Enter Full Screen
Stefka Stoyanova...
|
8627
|
|
8627
|
166
|
12
|
2026-04-14T06:56:59.495129+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149819495_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsB AWSE3 SSHEl DatadogPlatform TeamQ Search board200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byJY-20458 CRUD reports with Ask. Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..AssigneeStatus3 Steliyan...i2 Stellyan .-.INDEYVINDEV vAnellya A...READY FOR DEV& Unassign...READY FOR DEVSDEPLOYEDDONEDONEDONEDONE MDONEDONEWUnEActions9:56 AM | Daily - Platform•10> 0.Daily - Platform • 9 m left8•CE: DepeDependol• Far DepeD Tasks2 Insights & CoachinD DevD ux.©2|In Dev vXz Improve Story~ Details32 Stellyan GeorgievAsuntoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcD Al BookmarksGroup: QueriesDEPLOY &leature fiag releaseDeployed0 -20585 15 * ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 1 0 •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 П ****=Missed deletion of stale0 m400R 4.0000(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan Georgievalik11:358 • Tue 14 Apr 9:56:59ChatNikolay YankovNikolay Ivanov...
|
NULL
|
8266537810111475323
|
NULL
|
click
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsB AWSE3 SSHEl DatadogPlatform TeamQ Search board200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byJY-20458 CRUD reports with Ask. Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..AssigneeStatus3 Steliyan...i2 Stellyan .-.INDEYVINDEV vAnellya A...READY FOR DEV& Unassign...READY FOR DEVSDEPLOYEDDONEDONEDONEDONE MDONEDONEWUnEActions9:56 AM | Daily - Platform•10> 0.Daily - Platform • 9 m left8•CE: DepeDependol• Far DepeD Tasks2 Insights & CoachinD DevD ux.©2|In Dev vXz Improve Story~ Details32 Stellyan GeorgievAsuntoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcD Al BookmarksGroup: QueriesDEPLOY &leature fiag releaseDeployed0 -20585 15 * ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 1 0 •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 П ****=Missed deletion of stale0 m400R 4.0000(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan Georgievalik11:358 • Tue 14 Apr 9:56:59ChatNikolay YankovNikolay Ivanov...
|
NULL
|
|
8624
|
166
|
11
|
2026-04-14T06:56:55.430786+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149815430_m1.jpg...
|
Firefox
|
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf — Work...
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Close tab
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM...
|
[{"role":"AXRadioButton","text [{"role":"AXRadioButton","text":"JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jiminny","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jiminny","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":true},{"role":"AXStaticText","text":"Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Close tab","depth":5,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Tab","depth":4,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Customize sidebar","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Close Google Gemini (⌃X)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Tabs from other devices","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open history (⇧⌘H)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open bookmarks (⌘B)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"AI Chat settings","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Close","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"WORK, Google Account: lukas.kovalik@jiminny.com","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Main menu","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Chat","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Open menu for conversation actions.","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXHeading","text":"Conversation with Gemini","depth":15,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Conversation with Gemini","depth":16,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Expand","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"This Python script defines the","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator","depth":25,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Core Functionality","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Core Functionality","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Data Collection (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"collect_test_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"* Supports fetching raw samples from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Elasticsearch (ES)","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or loading them from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"local JSON files","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Implements a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"replayability check","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":": It probes the live API endpoint to ensure the activity exists before including it in the dataset.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Filters data based on request body size, language, and the presence of an original response.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Ground Truth Management:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Saves inputs and \"gold standard\" responses to a local directory structure (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"eval_output/ai_activity_type/...","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":").","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Features an optional mode to populate ground truth using a high-order LLM (e.g.,","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"gpt-5_2","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") if the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"POPULATE_GROUND_TRUTH_LLM","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"}]...
|
2367160298329268996
|
8732321579331784839
|
click
|
accessibility
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Ask Jiminny test report - 8 Apr 2026 - Ask Jiminny test report - 13 Apr 2026.pdf
Close tab
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM...
|
8622
|
|
8622
|
166
|
10
|
2026-04-14T06:56:40.315536+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149800315_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsB AWSE3 SSHEl DatadogPlatform TeamQ Search board200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byA JY-20458 CRUD reports with Ask Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..AssigneeStatus3 Steliyan...i2 Stellyan .-.INDEYVINDEV vAnellya A...READY FOR DEV& Unassign...READY FOR DEVSDEPLOYEDDONEDONE MDONEDONE MDONEDONEWUnEActions9:56 AM | Daily - Platform> 0.Daily - Platform • 9 m left8•CE: DepeDependol• Far DepeD TasksInsights & CoachinD DevD ux.©2|In Dev vXz Improve Story~ Details32 Stellyan GeorgievAsuntoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcD Al BookmarksGroup: QueriesDEPLOY &leature fiag releaseDeployed0 -20585 15 * ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 1 0 •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 ^ ****=|Missed deletion of stale408099(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka Stoyanovateliyan Georgievalik11:168 • Tue 14 Apr 9:56:40ChatNikolay YankovNikolay Ivanov...
|
NULL
|
-589028587563285134
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsB AWSE3 SSHEl DatadogPlatform TeamQ Search board200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byA JY-20458 CRUD reports with Ask Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..AssigneeStatus3 Steliyan...i2 Stellyan .-.INDEYVINDEV vAnellya A...READY FOR DEV& Unassign...READY FOR DEVSDEPLOYEDDONEDONE MDONEDONE MDONEDONEWUnEActions9:56 AM | Daily - Platform> 0.Daily - Platform • 9 m left8•CE: DepeDependol• Far DepeD TasksInsights & CoachinD DevD ux.©2|In Dev vXz Improve Story~ Details32 Stellyan GeorgievAsuntoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcD Al BookmarksGroup: QueriesDEPLOY &leature fiag releaseDeployed0 -20585 15 * ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 1 0 •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 ^ ****=|Missed deletion of stale408099(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka Stoyanovateliyan Georgievalik11:168 • Tue 14 Apr 9:56:40ChatNikolay YankovNikolay Ivanov...
|
NULL
|
|
8621
|
166
|
9
|
2026-04-14T06:56:37.283165+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149797283_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsB AWSE3 SSHEl DatadogPlatform TeamQ Search board200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byA JY-20458 CRUD reports with Ask Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..AssigneeStatus3 Steliyan...i2 Stellyan .-.INDEYVINDEV vAnellya A...READY FOR DEV& Unassign...READY FOR DEVSDEPLOYEDDONEDONE MDONEDONE MDONEDONEWUnEActions9:56 AM | Daily - Platform> 0.Daily - Platform • 9 m left8•CE: DepeDependol• Far DepeD TasksInsights & CoachinD DevD ux.©2|In Dev vXz Improve Story~ Details32 Stellyan GeorgievAsuntoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcD Al BookmarksGroup: QueriesDEPLOY &leature fiag releaseDeployed0 -20585 15 * ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 1 0 •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 ^ ****=|Missed deletion of stale408099(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka Stoyanovateliyan Georgievalik11:128 • Tue 14 Apr 9:56:36ChatNikolay YankovNikolay Ivanov...
|
NULL
|
97371676481529223
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsB AWSE3 SSHEl DatadogPlatform TeamQ Search board200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byA JY-20458 CRUD reports with Ask Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..AssigneeStatus3 Steliyan...i2 Stellyan .-.INDEYVINDEV vAnellya A...READY FOR DEV& Unassign...READY FOR DEVSDEPLOYEDDONEDONE MDONEDONE MDONEDONEWUnEActions9:56 AM | Daily - Platform> 0.Daily - Platform • 9 m left8•CE: DepeDependol• Far DepeD TasksInsights & CoachinD DevD ux.©2|In Dev vXz Improve Story~ Details32 Stellyan GeorgievAsuntoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcD Al BookmarksGroup: QueriesDEPLOY &leature fiag releaseDeployed0 -20585 15 * ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 1 0 •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 ^ ****=|Missed deletion of stale408099(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka Stoyanovateliyan Georgievalik11:128 • Tue 14 Apr 9:56:36ChatNikolay YankovNikolay Ivanov...
|
8617
|
|
8617
|
166
|
8
|
2026-04-14T06:56:10.958149+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149770958_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsB AWSE3 SSHEl DatadogPlatform TeamQ Search board200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byA JY-20458 CRUD reports with Ask Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..AssigneeStatus3 Steliyan...i2 Stellyan .-.INDEYVINDEV vAnellya A...READY FOR DEV& Unassign...READY FOR DEVSDEPLOYEDDONEDONEDONEDONE MDONEDONEWUnEActions9:56 AM | Daily - Platform> 0.Daily - Platform - 9m left8•CE: DepeDependol• Far DepeD TasksInsights & CoachinD DevD ux.©2|In Dev vXz Improve Story~ Details32 Stellyan GeorgievAsentoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcD Al BookmarksGroup: QueriesDEPLOY &leature fiag releaseDeployed0 -20585 15 * ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 1 0 •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 ^ ****=|Missed deletion of stalea080se(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan Georgievvalik10:468 • Tue 14 Apr 9:56:10Chat6Nikolay YankovNikolay Ivanov...
|
NULL
|
-1740625934814605516
|
NULL
|
click
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsB AWSE3 SSHEl DatadogPlatform TeamQ Search board200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byA JY-20458 CRUD reports with Ask Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..AssigneeStatus3 Steliyan...i2 Stellyan .-.INDEYVINDEV vAnellya A...READY FOR DEV& Unassign...READY FOR DEVSDEPLOYEDDONEDONEDONEDONE MDONEDONEWUnEActions9:56 AM | Daily - Platform> 0.Daily - Platform - 9m left8•CE: DepeDependol• Far DepeD TasksInsights & CoachinD DevD ux.©2|In Dev vXz Improve Story~ Details32 Stellyan GeorgievAsentoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcD Al BookmarksGroup: QueriesDEPLOY &leature fiag releaseDeployed0 -20585 15 * ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 1 0 •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 ^ ****=|Missed deletion of stalea080se(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan Georgievvalik10:468 • Tue 14 Apr 9:56:10Chat6Nikolay YankovNikolay Ivanov...
|
NULL
|
|
8615
|
166
|
7
|
2026-04-14T06:56:10.121660+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149770121_m1.jpg...
|
Firefox
|
Jiminny — Work
|
1
|
app.staging.jiminny.com/ai-reports
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Close tab
New Tab
Customize sidebar...
|
[{"role":"AXRadioButton","text [{"role":"AXRadioButton","text":"JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jiminny","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":true},{"role":"AXStaticText","text":"Jiminny","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Close tab","depth":5,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Tab","depth":4,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Customize sidebar","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false}]...
|
-4024085138441079458
|
6160734901706043051
|
click
|
accessibility
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Close tab
New Tab
Customize sidebar...
|
8613
|
|
8613
|
166
|
6
|
2026-04-14T06:55:57.999851+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149757999_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFileEditViewTabsBookmarksHistoryExtensionsWindo DiaFileEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)SRD-677XF7 (SRD-67C Jy 19798(7 Service-oir Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsE3 SSHEl DatadogPlatform Team200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byJY-20458 CRUD reports with Ask. Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..Assignee3 Steliyan...i2 Stellyan .-.Anellya A...& Unassign...DEPLOYEDDONEDONE MDONEDONE MDONEDONEWUnEActionsStatusINDEYVINDEV vREADY FOR DEV <READY FOR DEVSTurn on microphone (88 + d)9:55 AM | Daily - Platform(ablDaily - Platform - 10 m left100% [8• Tue 14 Apr 9:55:57ChatCE: DeperDependolD TasksInsights & Coachin• DelD ux.© 2Xz Improve Story~ Details32 Stellyan GeorgievAsentoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failing5 days agcRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons8•• Far DepeGroup: QueriesDEPLOY &leature fiag releaseDeployed12006515 * *00=AI Review - Q1CTNMEUANANORCORDeployed_19982 |1 0 •*=•=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 11 ****=Missed deletion of stale0 -400R3 409009(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaNikolay YankovSteliyan GeorgievNikolay lvanov#Lukas Kovalik10:33...
|
NULL
|
-7529556405988642857
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFileEditViewTabsBookmarksHistoryExtensionsWindo DiaFileEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)SRD-677XF7 (SRD-67C Jy 19798(7 Service-oir Font. AmrCostI Clhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsE3 SSHEl DatadogPlatform Team200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=PriorityStor..% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byJY-20458 CRUD reports with Ask. Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..Assignee3 Steliyan...i2 Stellyan .-.Anellya A...& Unassign...DEPLOYEDDONEDONE MDONEDONE MDONEDONEWUnEActionsStatusINDEYVINDEV vREADY FOR DEV <READY FOR DEVSTurn on microphone (88 + d)9:55 AM | Daily - Platform(ablDaily - Platform - 10 m left100% [8• Tue 14 Apr 9:55:57ChatCE: DeperDependolD TasksInsights & Coachin• DelD ux.© 2Xz Improve Story~ Details32 Stellyan GeorgievAsentoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failing5 days agcRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons8•• Far DepeGroup: QueriesDEPLOY &leature fiag releaseDeployed12006515 * *00=AI Review - Q1CTNMEUANANORCORDeployed_19982 |1 0 •*=•=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 11 ****=Missed deletion of stale0 -400R3 409009(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaNikolay YankovSteliyan GeorgievNikolay lvanov#Lukas Kovalik10:33...
|
NULL
|
|
8611
|
166
|
5
|
2026-04-14T06:55:55.704139+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149755704_m1.jpg...
|
Dia
|
Work: Meet - Daily - Pla…
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
Nikolay Yankov (Presenting)
Nikolay Yankov (Presen Nikolay Yankov (Presenting)
Nikolay Yankov (Presenting)
People
6
Take notes with Gemini
Take notes with Gemini
Gemini
Gemini
Zoom in
Open in new window
Enter Full Screen
Stefka Stoyanova
Nikolay Yankov
Steliyan Georgiev
Nikolay Ivanov
Lukas Kovalik
Others might see more of your background. Click to view your full video.
9:55
AM
Daily - Platform
Daily - Platform
Audio settings
Turn on microphone
Video settings
Turn off camera
Nikolay Yankov is presenting
Send a reaction
Turn on captions
Raise hand (ctrl + ⌘ + h)
More options
Leave call
Meeting details
Chat with everyone
Meeting tools
Your microphone is off.
meet.google.com / Daily - Platform...
|
[{"role":"AXHeading","text" [{"role":"AXHeading","text":"Nikolay Yankov (Presenting)","depth":13,"bounds":{"left":0.045833334,"top":0.11,"width":0.125,"height":0.022222223},"role_description":"heading"},{"role":"AXStaticText","text":"Nikolay Yankov (Presenting)","depth":14,"bounds":{"left":0.045833334,"top":0.11111111,"width":0.125,"height":0.02},"role_description":"text"},{"role":"AXPopUpButton","text":"People","depth":15,"bounds":{"left":0.8826389,"top":0.09888889,"width":0.04097222,"height":0.04},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"6","depth":18,"bounds":{"left":0.91041666,"top":0.11,"width":0.0048611113,"height":0.016666668},"role_description":"text"},{"role":"AXPopUpButton","text":"Take notes with Gemini","depth":15,"bounds":{"left":0.9291667,"top":0.09888889,"width":0.025,"height":0.04},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Take notes with Gemini","depth":19,"bounds":{"left":0.93194443,"top":0.11,"width":0.022222223,"height":0.016666668},"role_description":"text"},{"role":"AXStaticText","text":"Gemini","depth":19,"bounds":{"left":0.9625,"top":0.11,"width":0.022222223,"height":0.016666668},"role_description":"text"},{"role":"AXButton","text":"Gemini","depth":18,"bounds":{"left":0.9604167,"top":0.1,"width":0.023611112,"height":0.037777778},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Zoom in","depth":14,"bounds":{"left":0.6201389,"top":0.7922222,"width":0.027777778,"height":0.044444446},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Open in new window","depth":15,"bounds":{"left":0.65347224,"top":0.7922222,"width":0.027777778,"height":0.044444446},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Enter Full Screen","depth":15,"bounds":{"left":0.68680555,"top":0.7922222,"width":0.027777778,"height":0.044444446},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Stefka Stoyanova","depth":17,"bounds":{"left":0.7430556,"top":0.36666667,"width":0.08888889,"height":0.022222223},"role_description":"text"},{"role":"AXStaticText","text":"Nikolay Yankov","depth":17,"bounds":{"left":0.87222224,"top":0.36666667,"width":0.077083334,"height":0.022222223},"role_description":"text"},{"role":"AXStaticText","text":"Steliyan Georgiev","depth":17,"bounds":{"left":0.7430556,"top":0.6188889,"width":0.09097222,"height":0.022222223},"role_description":"text"},{"role":"AXStaticText","text":"Nikolay Ivanov","depth":17,"bounds":{"left":0.87222224,"top":0.6188889,"width":0.075,"height":0.022222223},"role_description":"text"},{"role":"AXStaticText","text":"Lukas Kovalik","depth":17,"bounds":{"left":0.7423611,"top":0.8711111,"width":0.06944445,"height":0.022222223},"role_description":"text"},{"role":"AXButton","text":"Others might see more of your background. Click to view your full video.","depth":14,"bounds":{"left":0.9618056,"top":0.86777776,"width":0.018055556,"height":0.028888889},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"9:55","depth":10,"bounds":{"left":0.020833334,"top":0.93777776,"width":0.022916667,"height":0.022222223},"role_description":"text"},{"role":"AXStaticText","text":"AM","depth":10,"bounds":{"left":0.046527777,"top":0.93777776,"width":0.018055556,"height":0.022222223},"role_description":"text"},{"role":"AXHeading","text":"Daily - Platform","depth":10,"bounds":{"left":0.08194444,"top":0.90444446,"width":0.08125,"height":0.08888889},"role_description":"heading"},{"role":"AXStaticText","text":"Daily - Platform","depth":13,"bounds":{"left":0.08194444,"top":0.93777776,"width":0.08125,"height":0.022222223},"role_description":"text"},{"role":"AXPopUpButton","text":"Audio settings","depth":14,"bounds":{"left":0.30416667,"top":0.9222222,"width":0.06111111,"height":0.053333335},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Turn on microphone","depth":14,"bounds":{"left":0.33194444,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":true,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"Video settings","depth":14,"bounds":{"left":0.37083334,"top":0.9222222,"width":0.06111111,"height":0.053333335},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Turn off camera","depth":14,"bounds":{"left":0.3986111,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Nikolay Yankov is presenting","depth":12,"bounds":{"left":0.4375,"top":0.9222222,"width":0.03888889,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXCheckBox","text":"Send a reaction","depth":12,"bounds":{"left":0.48194444,"top":0.9222222,"width":0.03888889,"height":0.053333335},"role_description":"toggle button","subrole":"AXToggleButton","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Turn on captions","depth":11,"bounds":{"left":0.5263889,"top":0.9222222,"width":0.03888889,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXCheckBox","text":"Raise hand (ctrl + ⌘ + h)","depth":11,"bounds":{"left":0.5708333,"top":0.9222222,"width":0.03888889,"height":0.053333335},"role_description":"toggle button","subrole":"AXToggleButton","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXPopUpButton","text":"More options","depth":14,"bounds":{"left":0.61527777,"top":0.9222222,"width":0.025,"height":0.053333335},"role_description":"pop up button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Leave call","depth":11,"bounds":{"left":0.6458333,"top":0.9222222,"width":0.05,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Meeting details","depth":12,"bounds":{"left":0.8875,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Chat with everyone","depth":12,"bounds":{"left":0.92083335,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Meeting tools","depth":12,"bounds":{"left":0.95416665,"top":0.9222222,"width":0.033333335,"height":0.053333335},"role_description":"button","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXStaticText","text":"Your microphone is off.","depth":4,"bounds":{"left":0.004166667,"top":0.08111111,"width":0.00069444446,"height":0.0011111111},"role_description":"text"},{"role":"AXTextArea","text":"meet.google.com / Daily - Platform","depth":5,"bounds":{"left":0.11111111,"top":0.04111111,"width":0.21597221,"height":0.033333335},"automation_id":"navigationBarAssistantBarTextField","value":"meet.google.com / Daily - Platform","role_description":"text entry area","is_focused":false}]...
|
6194137157482527795
|
6113912317546658616
|
click
|
accessibility
|
NULL
|
Nikolay Yankov (Presenting)
Nikolay Yankov (Presen Nikolay Yankov (Presenting)
Nikolay Yankov (Presenting)
People
6
Take notes with Gemini
Take notes with Gemini
Gemini
Gemini
Zoom in
Open in new window
Enter Full Screen
Stefka Stoyanova
Nikolay Yankov
Steliyan Georgiev
Nikolay Ivanov
Lukas Kovalik
Others might see more of your background. Click to view your full video.
9:55
AM
Daily - Platform
Daily - Platform
Audio settings
Turn on microphone
Video settings
Turn off camera
Nikolay Yankov is presenting
Send a reaction
Turn on captions
Raise hand (ctrl + ⌘ + h)
More options
Leave call
Meeting details
Chat with everyone
Meeting tools
Your microphone is off.
meet.google.com / Daily - Platform...
|
NULL
|
|
8607
|
166
|
4
|
2026-04-14T06:55:41.411577+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149741411_m1.jpg...
|
Firefox
|
Jiminny — Work
|
1
|
app.staging.jiminny.com/ai-reports
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Close tab
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow
Parsing:
The script uses
create_parser
to handle CLI arguments like
--data-set
,
--populate-ground-truth-llm
, and
--compare-to-ground-truth
.
Initialization:
Sets up directory paths for inputs and ground truths based on the selected dataset name.
Run:
Executes the evaluation via
asyncio.run(run_evaluation(...))
, which orchestrates the collection, endpoint calling, and result scoring phases.
Copy prompt
Edit
You said are there soem issues with this PR
You said
are there soem issues with this PR
Listen
Show more options
Show thinking
Gemini said
Gemini said
Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.
Here is a breakdown of the most notable issues, categorized by impact:
Performance & Async Anti-Patterns
Performance & Async Anti-Patterns
Sequential Requests Defeating Async (Lines 86-106):
Inside
_collect_from_es
, the code loops through
raw_samples
and
await
s a
session.post
request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using
aiohttp
for asynchronous concurrency.
Fix:
Collect these coroutines into a list and use
asyncio.gather()
(ideally wrapped in an
asyncio.Semaphore
to avoid overwhelming the endpoint) to run the replayability probes concurrently.
Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):
Standard synchronous file operations (
with open(...)
,
json.dump
,
json.load
) are used inside
async def
functions. In Python, this blocks the entire event loop while the disk is reading/writing.
Fix:
Use an asynchronous file I/O library like
aiofiles
, or run the blocking I/O operations in a separate thread using
asyncio.to_thread()
.
Architecture & State Management
Architecture & State Management
Mutating Class-Level Variables (Lines 323-328):
The script parses CLI arguments and directly mutates the class attributes (e.g.,
AiActivityTypeEvaluator.DATA_SET = args.data_set
). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.
Fix:
Pass these arguments into the
__init__
constructor of the instance (
self.data_set = args.data_set
) rather than modifying the class definition globally.
Code Quality & Edge Cases
Code Quality & Edge Cases
Broad Exception Catching (Line 107):
Using a bare
except Exception:
during the replayability probe is dangerous. It will catch
aiohttp.ClientError
(which is intended), but it will also silently swallow
TimeoutError
or even memory errors, making debugging incredibly difficult if the endpoint goes down.
Fix:
Scope the exception to specific expected errors, like
except (aiohttp.ClientError, asyncio.TimeoutError):
.
Inefficient Size Calculation (Line 95):
len(json.dumps(request_body)) < min_size
requires serializing the entire dictionary to a string
just
to check its length. Since
min_size
defaults to...
|
[{"role":"AXRadioButton","text [{"role":"AXRadioButton","text":"JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXRadioButton","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXRadioButton","text":"Jiminny","depth":4,"help_text":"","role_description":"tab","subrole":"AXTabButton","is_enabled":true,"is_focused":false,"is_selected":true},{"role":"AXStaticText","text":"Jiminny","depth":5,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Close tab","depth":5,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Tab","depth":4,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Customize sidebar","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Close Google Gemini (⌃X)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Tabs from other devices","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open history (⇧⌘H)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXCheckBox","text":"Open bookmarks (⌘B)","depth":6,"help_text":"","role_description":"toggle button","subrole":"AXToggle","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"AI Chat settings","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Close","depth":7,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"WORK, Google Account: lukas.kovalik@jiminny.com","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXButton","text":"Main menu","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"New Chat","depth":12,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Open menu for conversation actions.","depth":12,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false,"is_expanded":false},{"role":"AXHeading","text":"Conversation with Gemini","depth":15,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Conversation with Gemini","depth":16,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\\nExpected: {gt_label}\\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Expand","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"This Python script defines the","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator","depth":25,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Core Functionality","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Core Functionality","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Data Collection (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"collect_test_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"* Supports fetching raw samples from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Elasticsearch (ES)","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or loading them from","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"local JSON files","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Implements a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"replayability check","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":": It probes the live API endpoint to ensure the activity exists before including it in the dataset.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Filters data based on request body size, language, and the presence of an original response.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Ground Truth Management:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Saves inputs and \"gold standard\" responses to a local directory structure (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"eval_output/ai_activity_type/...","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":").","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Features an optional mode to populate ground truth using a high-order LLM (e.g.,","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"gpt-5_2","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") if the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"POPULATE_GROUND_TRUTH_LLM","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"flag is set.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Evaluation Logic (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"evaluate_against_ground_truth","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Compares the model's predicted","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ai_activity_type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"against either the original ES response or the LLM-generated ground truth.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Assigns a","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"quality_score","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"of","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"1.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a match and","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"0.0","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for a mismatch or request failure.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Key Configuration Constants","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Key Configuration Constants","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Constant","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"ENDPOINT_PATH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"COMPARE_TO_GROUND_TRUTH","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"SKIP_UNREPLAYABLE_ON_COLLECT","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"REPLAYABILITY_CHECK_TIMEOUT_SECONDS","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Purpose","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The specific API route being tested:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"call/ai-activity-type","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Determines if the baseline is the original data (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"es","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") or an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"llm","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"If","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"True","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", discards records where the API returns \"Cannot find activity.\"","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Limits the probe request to 20 seconds.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy table","depth":24,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"Execution Workflow","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Execution Workflow","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Parsing:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script uses","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"create_parser","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to handle CLI arguments like","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--data-set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--populate-ground-truth-llm","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", and","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"--compare-to-ground-truth","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Initialization:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sets up directory paths for inputs and ground truths based on the selected dataset name.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Run:","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Executes the evaluation via","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.run(run_evaluation(...))","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", which orchestrates the collection, endpoint calling, and result scoring phases.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Copy prompt","depth":21,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Edit","depth":21,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXHeading","text":"You said are there soem issues with this PR","depth":21,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"You said","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"are there soem issues with this PR","depth":23,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXButton","text":"Listen","depth":22,"help_text":"","role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXButton","text":"Show more options","depth":20,"role_description":"button","subrole":"AXUnknown","is_enabled":true,"is_focused":false,"is_selected":false},{"role":"AXStaticText","text":"Show thinking","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Gemini said","depth":20,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Gemini said","depth":21,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Here is a breakdown of the most notable issues, categorized by impact:","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Performance & Async Anti-Patterns","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Performance & Async Anti-Patterns","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Sequential Requests Defeating Async (Lines 86-106):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Inside","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"_collect_from_es","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", the code loops through","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"raw_samples","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"and","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"await","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"s a","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"session.post","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiohttp","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"for asynchronous concurrency.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Collect these coroutines into a list and use","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.gather()","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"(ideally wrapped in an","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.Semaphore","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to avoid overwhelming the endpoint) to run the replayability probes concurrently.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Standard synchronous file operations (","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"with open(...)","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"json.dump","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":",","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"json.load","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") are used inside","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"async def","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"functions. In Python, this blocks the entire event loop while the disk is reading/writing.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Use an asynchronous file I/O library like","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiofiles","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":", or run the blocking I/O operations in a separate thread using","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"asyncio.to_thread()","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"bounds":{"left":0.0013888889,"top":0.08111111,"width":0.0027777778,"height":0.023333333},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Architecture & State Management","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Architecture & State Management","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Mutating Class-Level Variables (Lines 323-328):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"The script parses CLI arguments and directly mutates the class attributes (e.g.,","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"AiActivityTypeEvaluator.DATA_SET = args.data_set","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Pass these arguments into the","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"__init__","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"constructor of the instance (","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"self.data_set = args.data_set","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":") rather than modifying the class definition globally.","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXHeading","text":"Code Quality & Edge Cases","depth":23,"help_text":"","role_description":"heading","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Code Quality & Edge Cases","depth":24,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Broad Exception Catching (Line 107):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Using a bare","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"except Exception:","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"during the replayability probe is dangerous. It will catch","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"aiohttp.ClientError","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"(which is intended), but it will also silently swallow","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"TimeoutError","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"or even memory errors, making debugging incredibly difficult if the endpoint goes down.","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Fix:","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Scope the exception to specific expected errors, like","depth":28,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"except (aiohttp.ClientError, asyncio.TimeoutError):","depth":29,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":".","depth":28,"bounds":{"left":0.0,"top":1.0,"width":0.0027777778,"height":-0.04111111},"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"Inefficient Size Calculation (Line 95):","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"len(json.dumps(request_body)) < min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"requires serializing the entire dictionary to a string","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"just","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"to check its length. Since","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"min_size","depth":27,"help_text":"","role_description":"text","subrole":"AXUnknown"},{"role":"AXStaticText","text":"defaults to","depth":26,"help_text":"","role_description":"text","subrole":"AXUnknown"}]...
|
7596023454092539860
|
6426478565827326231
|
click
|
accessibility
|
NULL
|
JY-20543 add AJ reports User pilot tracking by Lak JY-20543 add AJ reports User pilot tracking by LakyLak · Pull Request #11932 · jiminny/app
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Platform Sprint 1 Q2 - Platform Team - Scrum Board - Jira
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jy 19798 evaluation for ai activity types by nikolaybiaivanov · Pull Request #468 · jiminny/prophet
Jiminny
Jiminny
Close tab
New Tab
Customize sidebar
Close Google Gemini (⌃X)
Tabs from other devices
Open history (⇧⌘H)
Open bookmarks (⌘B)
AI Chat settings
Close
WORK, Google Account: [EMAIL]
Main menu
New Chat
Open menu for conversation actions.
Conversation with Gemini
Conversation with Gemini
Copy prompt
You said I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected. Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
You said
I’m on page “<tabTitle>Jy 19798 evaluation for ai activity types by nikol</tabTitle>” with “<selection>@@ -0,0 +1,330 @@1+#!/usr/bin/env python32+import asyncio3+import json4+import os5+from pathlib import Path6+from typing import Any, Dict, List7+8+import aiohttp9+10+from src.component.call_processing.schemes.ai_activity_type_request import AiActivityTypeRequest11+from src.component.call_processing.schemes.ai_activity_type_response import AiActivityTypeResponse12+from src.eval.abstract_evaluator import AbstractEvaluator13+from src.eval.collectors import collect_test_set as es_collect_test_set14+from src.eval.common_evaluator_runner import create_parser, run_evaluation15+from src.eval.endpoints import run_endpoint as run_endpoint_http16+from src.eval.models import EvaluationResult17+18+19+class AiActivityTypeEvaluator(AbstractEvaluator):20+ENDPOINT_PATH = "call/ai-activity-type"21+REQUEST_MODEL = AiActivityTypeRequest22+RESPONSE_MODEL = AiActivityTypeResponse23+OUTPUT_SUBFOLDER = "ai_activity_type"24+USES_LLM_EVALUATION = False25+26+READS_FROM_DATASET_FOLDER = True27+28+DATA_SET: str = "initial"29+30+MINIMUM_REQUEST_BODY_SIZE: int = 031+32+SKIP_UNREPLAYABLE_ON_COLLECT: bool = True33+REPLAYABILITY_CHECK_TIMEOUT_SECONDS: float = 20.034+35+POPULATE_GROUND_TRUTH_LLM: bool = False36+GROUND_TRUTH_LLM_ENDPOINT: str = "OPENAI:gpt-5_2"37+38+COMPARE_TO_GROUND_TRUTH: str = "es"39+40+def __init__(self, **kwargs: Any) -> None:41+super().__init__(**kwargs)42+_repo_root = Path(__file__).parents[2]43+_datasets_root = _repo_root / "eval_output" / self.OUTPUT_SUBFOLDER / "eval_datasets"44+_dataset_dir = _datasets_root / self.__class__.DATA_SET45+self._dataset_dir = _dataset_dir46+self._inputs_dir = _dataset_dir / "input"47+self._ground_truths_dir = _dataset_dir / "ground_truth"48+49+async def collect_test_set(self) -> List[Dict[str, Any]]:50+if self.es_host_name:51+return await self._collect_from_es()52+return await self._collect_from_files()53+54+async def _collect_from_es(self) -> List[Dict[str, Any]]:55+if self._inputs_dir.exists() or self._ground_truths_dir.exists():56+raise FileExistsError(57+f"Dataset folder already exists: {self._inputs_dir.parent}. "58+"Choose a different --data-set name or remove the folder manually."59+ )60+61+raw_samples = await es_collect_test_set(62+es_host_name=self.es_host_name,63+es_port=self.es_port,64+endpoint_path=self.ENDPOINT_PATH,65+limit=self.limit,66+language_filter=self.language_filter,67+min_prompt_length=self.min_prompt_length,68+is_json_response=self.IS_JSON_RESPONSE,69+exclude_languages=self.exclude_languages,70+ )71+72+self._inputs_dir.mkdir(parents=True, exist_ok=True)73+self._ground_truths_dir.mkdir(parents=True, exist_ok=True)74+75+min_size = self.MINIMUM_REQUEST_BODY_SIZE76+samples: List[Dict[str, Any]] = []77+skipped_no_response = 078+skipped_too_small = 079+skipped_unreplayable = 080+skipped_probe_errors = 081+82+endpoint_url = f"{self.api_host}/{self.ENDPOINT_PATH}"83+84+timeout = aiohttp.ClientTimeout(total=self.REPLAYABILITY_CHECK_TIMEOUT_SECONDS)85+async with aiohttp.ClientSession(timeout=timeout) as session:86+for sample in raw_samples:87+es_id = sample["id"]88+request_body = sample["request_body"]89+original_response = sample.get("original_response")90+91+if not original_response:92+skipped_no_response += 193+continue94+95+if min_size and len(json.dumps(request_body)) < min_size:96+skipped_too_small += 197+continue98+99+if self.SKIP_UNREPLAYABLE_ON_COLLECT:100+try:101+async with session.post(endpoint_url, json=request_body) as resp:102+if resp.status != 200:103+body_text = await resp.text()104+if "Cannot find activity" in body_text:105+skipped_unreplayable += 1106+continue107+except Exception:108+skipped_probe_errors += 1109+110+with open(self._inputs_dir / f"{es_id}.json", "w") as f:111+json.dump(request_body, f, indent=2)112+113+with open(self._ground_truths_dir / f"gt_{es_id}.json", "w") as f:114+json.dump(original_response, f, indent=2)115+116+samples.append(117+ {118+"id": es_id,119+"request_body": request_body,120+"original_response": original_response,121+"ground_truth": original_response,122+ }123+ )124+125+print(f"Wrote {len(samples)} samples to {self._inputs_dir.parent}")126+if skipped_no_response:127+print(f"Skipped {skipped_no_response} records with no response body")128+if skipped_too_small:129+print(130+f"Skipped {skipped_too_small} records whose request body was smaller "131+f"than {min_size} characters"132+ )133+if skipped_unreplayable:134+print(f"Skipped {skipped_unreplayable} records that are not replayable (missing activity in ES)")135+if skipped_probe_errors:136+print(f"Warning: replayability probe errored for {skipped_probe_errors} records (kept those records)")137+138+if self.__class__.POPULATE_GROUND_TRUTH_LLM:139+label_results = await run_endpoint_http(140+test_set=samples,141+endpoint_url=endpoint_url,142+request_model=self.REQUEST_MODEL,143+force_llm_endpoint=self.__class__.GROUND_TRUTH_LLM_ENDPOINT,144+is_streaming_response=self.IS_STREAMING_RESPONSE,145+timeout_seconds=300.0,146+parallel_requests=self.parallel_requests,147+return_prompt_in_run=self.RETURN_PROMPT_IN_RUN,148+ )149+150+for i, r in enumerate(label_results):151+forced = self.__class__.GROUND_TRUTH_LLM_ENDPOINT152+model = forced.split(":", 1)[1] if ":" in forced else forced153+samples[i]["ground_truth_llm_model"] = model154+155+if not r.get("success", True):156+samples[i]["ground_truth_llm"] = None157+samples[i]["ground_truth_llm_reasoning"] = None158+samples[i]["ground_truth_llm_error_type"] = r.get("error_type")159+samples[i]["ground_truth_llm_error_message"] = r.get("error_message")160+continue161+162+resp = r.get("new_response") or {}163+samples[i]["ground_truth_llm"] = resp.get("ai_activity_type")164+samples[i]["ground_truth_llm_reasoning"] = resp.get("ai_activity_type_reasoning")165+166+return samples167+168+async def _collect_from_files(self) -> List[Dict[str, Any]]:169+if not self._inputs_dir.exists():170+raise FileNotFoundError(171+f"Dataset folder not found: {self._inputs_dir}. "172+"Run --steps collect first or check your --data-set name."173+ )174+175+samples: List[Dict[str, Any]] = []176+for filename in sorted(os.listdir(self._inputs_dir)):177+input_path = self._inputs_dir / filename178+gt_path = self._ground_truths_dir / f"gt_{filename}"179+180+if not gt_path.exists():181+print(f"Warning: ground truth not found for {filename}, skipping")182+continue183+184+with open(input_path, "r") as f:185+request_body: Dict[str, Any] = json.load(f)186+187+with open(gt_path, "r") as f:188+ground_truth: Dict[str, Any] = json.load(f)189+190+samples.append(191+ {192+"id": Path(filename).stem,193+"request_body": request_body,194+"original_response": None,195+"ground_truth": ground_truth,196+ }197+ )198+199+print(f"Loaded {len(samples)} samples from {self._inputs_dir.parent}")200+return samples201+202+async def evaluate_against_ground_truth(self, results: List[Dict[str, Any]]) -> List[EvaluationResult]:203+evaluation_results: List[EvaluationResult] = []204+compare_to = self.__class__.COMPARE_TO_GROUND_TRUTH205+206+for res in results:207+req_id = res["request_id"]208+is_success = res.get("success", True)209+210+if not is_success:211+evaluation_results.append(212+EvaluationResult(213+request_id=req_id,214+request_body=res["request_body"],215+new_response=res.get("new_response"),216+valid_schema=False,217+quality_score=0.0,218+quality_feedback=f"Request failed: {res.get('error_type')} - {res.get('error_message')}",219+response_time=res.get("response_time"),220+success=False,221+error_type=res.get("error_type"),222+error_message=res.get("error_message"),223+ground_truth_match=False,224+ )225+ )226+continue227+228+pred = (res.get("new_response") or {}).get("ai_activity_type")229+230+if compare_to == "llm":231+if "ground_truth_llm" not in res:232+evaluation_results.append(233+EvaluationResult(234+request_id=req_id,235+request_body=res["request_body"],236+new_response=res.get("new_response"),237+valid_schema=res.get("valid_schema", True),238+quality_score=0.0,239+quality_feedback=(240+"Missing ground_truth_llm in run results. Ensure collect step populated it and "241+"endpoints.py passes it through."242+ ),243+response_time=res.get("response_time"),244+success=True,245+ground_truth_match=False,246+ )247+ )248+continue249+gt_label = res.get("ground_truth_llm")250+else:251+gt_path = self._ground_truths_dir / f"gt_{req_id}.json"252+if not gt_path.exists():253+evaluation_results.append(254+EvaluationResult(255+request_id=req_id,256+request_body=res["request_body"],257+new_response=res.get("new_response"),258+valid_schema=res.get("valid_schema", True),259+quality_score=0.0,260+quality_feedback=f"Ground truth file not found: {gt_path}",261+response_time=res.get("response_time"),262+success=True,263+ground_truth_match=False,264+ )265+ )266+continue267+268+with open(gt_path, "r") as f:269+ground_truth_dict: Dict[str, Any] = json.load(f)270+271+gt_label = (ground_truth_dict or {}).get("ai_activity_type")272+273+matched = pred == gt_label274+275+evaluation_results.append(276+EvaluationResult(277+request_id=req_id,278+request_body=res["request_body"],279+new_response=res.get("new_response"),280+valid_schema=res.get("valid_schema", True),281+quality_score=1.0 if matched else 0.0,282+quality_feedback="PASS" if matched else f"FAIL\nExpected: {gt_label}\nGot: {pred}",283+response_time=res.get("response_time"),284+success=True,285+ground_truth_match=matched,286+ )287+ )288+289+return evaluation_results290+291+292+if __name__ == "__main__":293+parser = create_parser(AiActivityTypeEvaluator)294+parser.add_argument(295+"--data-set",296+type=str,297+default="initial",298+dest="data_set",299+help="Dataset folder under eval_output/ai_activity_type/eval_datasets/ (default: initial)",300+ )301+302+parser.add_argument(303+"--populate-ground-truth-llm",304+action="store_true",305+dest="populate_ground_truth_llm",306+ )307+parser.add_argument(308+"--ground-truth-llm-endpoint",309+type=str,310+default="OPENAI:gpt-5_2",311+dest="ground_truth_llm_endpoint",312+ )313+314+parser.add_argument(315+"--compare-to-ground-truth",316+type=str,317+choices=["es", "llm"],318+default="es",319+dest="compare_to_ground_truth",320+ )321+322+args = parser.parse_args()323+AiActivityTypeEvaluator.DATA_SET = args.data_set324+325+AiActivityTypeEvaluator.POPULATE_GROUND_TRUTH_LLM = bool(args.populate_ground_truth_llm)326+AiActivityTypeEvaluator.GROUND_TRUTH_LLM_ENDPOINT = args.ground_truth_llm_endpoint327+328+AiActivityTypeEvaluator.COMPARE_TO_GROUND_TRUTH = args.compare_to_ground_truth329+330+asyncio.run(run_evaluation(AiActivityTypeEvaluator, args))</selection>” selected.
Please summarize the selection using precise and concise language. Use headers and bulleted lists in the summary, to make it scannable. Maintain the meaning and factual accuracy.
Expand
Listen
Show more options
Gemini said
Gemini said
This Python script defines the
AiActivityTypeEvaluator
class, a specialized tool for evaluating AI-driven activity type classification. It extends a base evaluation framework to handle data collection, ground-truth comparison, and performance scoring.
Core Functionality
Core Functionality
Data Collection (
collect_test_set
):
* Supports fetching raw samples from
Elasticsearch (ES)
or loading them from
local JSON files
.
Implements a
replayability check
: It probes the live API endpoint to ensure the activity exists before including it in the dataset.
Filters data based on request body size, language, and the presence of an original response.
Ground Truth Management:
Saves inputs and "gold standard" responses to a local directory structure (
eval_output/ai_activity_type/...
).
Features an optional mode to populate ground truth using a high-order LLM (e.g.,
gpt-5_2
) if the
POPULATE_GROUND_TRUTH_LLM
flag is set.
Evaluation Logic (
evaluate_against_ground_truth
):
Compares the model's predicted
ai_activity_type
against either the original ES response or the LLM-generated ground truth.
Assigns a
quality_score
of
1.0
for a match and
0.0
for a mismatch or request failure.
Key Configuration Constants
Key Configuration Constants
Constant
Purpose
ENDPOINT_PATH
The specific API route being tested:
call/ai-activity-type
.
COMPARE_TO_GROUND_TRUTH
Determines if the baseline is the original data (
es
) or an
llm
.
SKIP_UNREPLAYABLE_ON_COLLECT
If
True
, discards records where the API returns "Cannot find activity."
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Limits the probe request to 20 seconds.
Constant
ENDPOINT_PATH
COMPARE_TO_GROUND_TRUTH
SKIP_UNREPLAYABLE_ON_COLLECT
REPLAYABILITY_CHECK_TIMEOUT_SECONDS
Purpose
The specific API route being tested:
call/ai-activity-type
.
Determines if the baseline is the original data (
es
) or an
llm
.
If
True
, discards records where the API returns "Cannot find activity."
Limits the probe request to 20 seconds.
Copy table
Execution Workflow
Execution Workflow
Parsing:
The script uses
create_parser
to handle CLI arguments like
--data-set
,
--populate-ground-truth-llm
, and
--compare-to-ground-truth
.
Initialization:
Sets up directory paths for inputs and ground truths based on the selected dataset name.
Run:
Executes the evaluation via
asyncio.run(run_evaluation(...))
, which orchestrates the collection, endpoint calling, and result scoring phases.
Copy prompt
Edit
You said are there soem issues with this PR
You said
are there soem issues with this PR
Listen
Show more options
Show thinking
Gemini said
Gemini said
Yes, there are several issues and code smells in this Pull Request, ranging from performance bottlenecks to architectural anti-patterns.
Here is a breakdown of the most notable issues, categorized by impact:
Performance & Async Anti-Patterns
Performance & Async Anti-Patterns
Sequential Requests Defeating Async (Lines 86-106):
Inside
_collect_from_es
, the code loops through
raw_samples
and
await
s a
session.post
request for each sample one by one. This means the script waits for one HTTP request to finish before starting the next, completely defeating the purpose of using
aiohttp
for asynchronous concurrency.
Fix:
Collect these coroutines into a list and use
asyncio.gather()
(ideally wrapped in an
asyncio.Semaphore
to avoid overwhelming the endpoint) to run the replayability probes concurrently.
Blocking File I/O in Async Functions (Lines 110-114, 184-188, 268-269):
Standard synchronous file operations (
with open(...)
,
json.dump
,
json.load
) are used inside
async def
functions. In Python, this blocks the entire event loop while the disk is reading/writing.
Fix:
Use an asynchronous file I/O library like
aiofiles
, or run the blocking I/O operations in a separate thread using
asyncio.to_thread()
.
Architecture & State Management
Architecture & State Management
Mutating Class-Level Variables (Lines 323-328):
The script parses CLI arguments and directly mutates the class attributes (e.g.,
AiActivityTypeEvaluator.DATA_SET = args.data_set
). This is a major anti-pattern. If you ever try to run tests in parallel or create multiple instances of this evaluator in a single process, the state will bleed across them.
Fix:
Pass these arguments into the
__init__
constructor of the instance (
self.data_set = args.data_set
) rather than modifying the class definition globally.
Code Quality & Edge Cases
Code Quality & Edge Cases
Broad Exception Catching (Line 107):
Using a bare
except Exception:
during the replayability probe is dangerous. It will catch
aiohttp.ClientError
(which is intended), but it will also silently swallow
TimeoutError
or even memory errors, making debugging incredibly difficult if the endpoint goes down.
Fix:
Scope the exception to specific expected errors, like
except (aiohttp.ClientError, asyncio.TimeoutError):
.
Inefficient Size Calculation (Line 95):
len(json.dumps(request_body)) < min_size
requires serializing the entire dictionary to a string
just
to check its length. Since
min_size
defaults to...
|
8605
|
|
8605
|
166
|
3
|
2026-04-14T06:55:39.855574+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149739855_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. Amrhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsB AWSE3 SSHEl DatadogPlatform TeamQ Search board200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=Priority% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byA JY-20458 CRUD reports with Ask Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..CostI ClActionsAssignee3 Steliyan...i2 Stellyan .-.Anellya A...& Unassign...DEPLOYEDDONEDONE MDONEDONE MDONEDONEWUnEStatusINDEYVINDEV vREADY FOR DEVREADY FOR DEVS9:55 AM | Daily - PlatformDaily - Platform - 10 m left8•CE: DepeDependol• Far DepeD Tasks2 Insights & CoachinD DevD ux.©2|In Dev vXz Improve Story~ Details32 Stellyan GeorgievAsentoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcD Al BookmarksGroup: QueriesDEPLOY 8leature fiag releaseDeployed0 -20585 15 * ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 1 0 •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 ^ ****=|Missed deletion of staleaesese(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan GeorgievValik10:158 • Tue 14 Apr 9:55:39ChatNikolay YankovNikolay Ivanov...
|
NULL
|
6844727409046715082
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. Amrhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 ProjectsB AWSE3 SSHEl DatadogPlatform TeamQ Search board200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=Priority% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byA JY-20458 CRUD reports with Ask Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..CostI ClActionsAssignee3 Steliyan...i2 Stellyan .-.Anellya A...& Unassign...DEPLOYEDDONEDONE MDONEDONE MDONEDONEWUnEStatusINDEYVINDEV vREADY FOR DEVREADY FOR DEVS9:55 AM | Daily - PlatformDaily - Platform - 10 m left8•CE: DepeDependol• Far DepeD Tasks2 Insights & CoachinD DevD ux.©2|In Dev vXz Improve Story~ Details32 Stellyan GeorgievAsentoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcD Al BookmarksGroup: QueriesDEPLOY 8leature fiag releaseDeployed0 -20585 15 * ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 1 0 •***=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 ^ ****=|Missed deletion of staleaesese(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan GeorgievValik10:158 • Tue 14 Apr 9:55:39ChatNikolay YankovNikolay Ivanov...
|
NULL
|
|
8603
|
166
|
2
|
2026-04-14T06:55:36.821435+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149736821_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. Amrhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909Sminm0 ProjectsB AWSE3 SSHEl DatadogPlatform TeamQ Search board200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=Priority% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byJY-20458 CRUD reports with Ask. Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..CostI ClActionsAssignee3 Steliyan...i2 Stellyan .-.Anellya A...& Unassign...DEPLOYEDDONEDONE MDONEDONE MDONEDONEWUnEStatusINDEYVINDEV vREADY FOR DEVREADY FOR DEVS9:55 AM | Daily - PlatformDaily - Platform - 10 m left8•CE: DepeDependol• Far DepeD TasksInsights & CoachinD DevD ux.©2|In Dev vXz Improve Story~ Details32 Stellyan GeorgievAsuntoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcD Al BookmarksGroup: QueriesDEPLOY 8leature fiag releaseDeployed12006515 * *00=AI Review - Q1CTNMEUANANORCORDeployed19982 1 @ •..•=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 ^ ****=|Missed deletion of staleaeeese(Pipedrive) DI › status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan Georgievvalik10:128 • Tue 14 Apr 9:55:36ChatNikolay YankovNikolay Ivanov...
|
NULL
|
6745712654514242901
|
NULL
|
visual_change
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. Amrhttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909Sminm0 ProjectsB AWSE3 SSHEl DatadogPlatform TeamQ Search board200$ Jy-19240 / [ JY-18900READY FOR DEV 1oohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-20564 1 П ****=Priority% JY-20574 [AI) Ensure PDF formatting is good= Medium= Medium% JY-20586 (QA) Manual Testing= Medium% JY-20626 Integration testing= MediumLinked work itemsis cloned byJY-20458 CRUD reports with Ask. Jiminnyis tested byJY-20649 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an.JY-20660 Non-creator cannot delete generated Ask Jiminny report from A…JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...JY-20655 Generated Ask Jiminny Report emall is sent to recipients0 JY-20654 Generated Ask Jiminny Reoort is visible onty to creator and shar...JY-20650 Generated Ask Jiminny Report PDF includes branding, report se..CostI ClActionsAssignee3 Steliyan...i2 Stellyan .-.Anellya A...& Unassign...DEPLOYEDDONEDONE MDONEDONE MDONEDONEWUnEStatusINDEYVINDEV vREADY FOR DEVREADY FOR DEVS9:55 AM | Daily - PlatformDaily - Platform - 10 m left8•CE: DepeDependol• Far DepeD TasksInsights & CoachinD DevD ux.©2|In Dev vXz Improve Story~ Details32 Stellyan GeorgievAsuntoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcD Al BookmarksGroup: QueriesDEPLOY 8leature fiag releaseDeployed12006515 * *00=AI Review - Q1CTNMEUANANORCORDeployed19982 1 @ •..•=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 ^ ****=|Missed deletion of staleaeeese(Pipedrive) DI › status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan Georgievvalik10:128 • Tue 14 Apr 9:55:36ChatNikolay YankovNikolay Ivanov...
|
8597
|
|
8597
|
166
|
1
|
2026-04-14T06:55:17.879190+00:00
|
/Users/lukas/.screenpipe/data/data/2026-04-14/1776 /Users/lukas/.screenpipe/data/data/2026-04-14/1776149717879_m1.jpg...
|
NULL
|
NULL
|
1
|
NULL
|
monitor_1
|
NULL
|
NULL
|
NULL
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)E:1Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. AmrCostI ClActions shttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 Projectslã AwsE3 SSHEl DatadogPlatform TeamQ Search board209$ Jy-19240 / [ JY-18900READY FOR DEV 1Subtasksoohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-205641 П ****:PriorityStor...JY-20574 [Ai) Ensure PDF formatting is good= Medium9 JY-20582 (BE+AI+Infra] Create new queue= Medium% JY-20586 [QA] Manual Testing= Medium% JY-20626 Integration testing= Mediumis cioned byA JY-20458 CRUD reports with Ask JiminnyJY-20849 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an….JY-20660 Non-creator cannot delete generated Ask Jiminny report from A..JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...• JY-20654 Generated Ask Jiminny Report is visible oniy to creator and shar...• Jy-20650 Generated Ask Jiminheport for melodes orene ng. loport se* Claude3 CircieciA, SentrylOCI-Hide doneBulk editView in searchSteneneAnellya A.KEAUTDURUEY8 Unassign...READY FOR DEVSDErLOTEUYDONEMDONEDONEWUntDONE MDONEM9:55 AM | Daily - Platform(nh|Daily - Platform - 10 m left8•CE: DepeDependol• Far DepeD Tasks2 Insights & CoachinD DevD ux.© 2In Dev vXz Improve Story~ DetailsAssignee32 Stellyan GeorgievAsentoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcUrEhD Al BookmarksGroup: QueriesDEPLOY 8leature fiag reieaseDeployed0 -20585 15 + ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 |1 0 •**•=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 11 ****=Missed deletion of stale408099(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan GeorgievKovalik9:538 • Tue 14 Apr 9:55:17ChatNikolay YankovNikolay lvanov...
|
NULL
|
-8158254741196970098
|
NULL
|
click
|
ocr
|
NULL
|
DiaFilelEditViewTabsBookmarksHistoryExtensionsWind DiaFilelEditViewTabsBookmarksHistoryExtensionsWindowHelpmeet.google.com / Daily - PlatformNikolay Yankov (Presenting)E:1Inbox (2)PlactormSRD-677C Jy 19798(7 Service-oir Font. AmrCostI ClActions shttps://jiminny.atlassian.net/jira/software/c/projects/JY/boards/377selectedissue=JY-18909• Siminm0 Projectslã AwsE3 SSHEl DatadogPlatform TeamQ Search board209$ Jy-19240 / [ JY-18900READY FOR DEV 1Subtasksoohesnetixwirtexceed Fontawesomepackage limits(MAINTENANCISReady for Dev8-205641 П ****:PriorityStor...JY-20574 [Ai) Ensure PDF formatting is good= Medium9 JY-20582 (BE+AI+Infra] Create new queue= Medium% JY-20586 [QA] Manual Testing= Medium% JY-20626 Integration testing= Mediumis cioned byA JY-20458 CRUD reports with Ask JiminnyJY-20849 Generate Ask Jiminny Report and display it in AI ReportsJY-20656 Expired Ask Jiminny report template is automatically disabled an….JY-20660 Non-creator cannot delete generated Ask Jiminny report from A..JY-20651 Generated Ask Jiminny Report PDF displays Jiminny logo when n...• JY-20654 Generated Ask Jiminny Report is visible oniy to creator and shar...• Jy-20650 Generated Ask Jiminheport for melodes orene ng. loport se* Claude3 CircieciA, SentrylOCI-Hide doneBulk editView in searchSteneneAnellya A.KEAUTDURUEY8 Unassign...READY FOR DEVSDErLOTEUYDONEMDONEDONEWUntDONE MDONEM9:55 AM | Daily - Platform(nh|Daily - Platform - 10 m left8•CE: DepeDependol• Far DepeD Tasks2 Insights & CoachinD DevD ux.© 2In Dev vXz Improve Story~ DetailsAssignee32 Stellyan GeorgievAsentoeReporter2 Galya DimitrovaUeyele encne@ Open with VS Code|3 branches32 commits1 pull request1 build failingRelessesProduction2 See all deploymentsComponentsPlatformSub-ProductAoorcons5 days agcUrEhD Al BookmarksGroup: QueriesDEPLOY 8leature fiag reieaseDeployed0 -20585 15 + ****=AI Review - Q1CTNMEUANANORCORDeployed_19982 |1 0 •**•=No such column "Type' onentty 'Event"PLATTORM STABTUTTYDeployed9 -20384 05 11 ****=Missed deletion of stale408099(Pipedrive] DI > status notlabeled, date data missingREDUCE CHURNDeployed0 -20614 05 П ****=:Stefka StoyanovaSteliyan GeorgievKovalik9:538 • Tue 14 Apr 9:55:17ChatNikolay YankovNikolay lvanov...
|
NULL
|