Skip to content

Latest commit

 

History

History

topic-modelling

Requirements

  • bigartm, nltk
  • python 3.6

How to train

./train.sh # Training parameters is in there

How to test

./predict_pipeline.sh ./test.txt
python skill.py /tmp/pred-tm-83F18394

Debug infor

TopTokens (TopTokens):
#1: party(0.0085) law(0.0066) president(0.0055) court(0.0055) political(0.0053) minister(0.0047) act(0.0044) said(0.0032) rights(0.003) union(0.0028)
#2: india(0.011) al(0.009) indian(0.0068) russian(0.0047) language(0.0044) temple(0.0041) bc(0.0039) greek(0.0038) king(0.0038) pakistan(0.0034)
#3: army(0.014) air(0.01) military(0.0093) force(0.0088) battle(0.0082) forces(0.0069) ship(0.0066) navy(0.006) division(0.0051) command(0.005)
#4: award(0.012) art(0.0083) director(0.0073) television(0.0072) book(0.0071) radio(0.0069) awards(0.0069) tv(0.0068) published(0.0067) show(0.0066)
#5: league(0.022) club(0.016) game(0.015) football(0.014) player(0.012) cup(0.011) games(0.01) round(0.0081) championship(0.0077) win(0.0074)
#6: species(0.01) water(0.0038) white(0.0033) common(0.0032) genus(0.0029) often(0.0027) red(0.0026) black(0.0025) plant(0.0024) food(0.0022)
#7: power(0.0064) engine(0.0058) design(0.0052) air(0.0051) system(0.0045) car(0.0045) aircraft(0.0044) speed(0.0042) type(0.0035) model(0.0035)
#8: district(0.024) population(0.017) students(0.011) education(0.011) schools(0.0096) election(0.0083) town(0.0083) census(0.0076) community(0.0075) township(0.007)
#9: px(0.019) race(0.01) championships(0.0086) men(0.0082) racing(0.0081) women(0.0079) event(0.0077) medal(0.006) points(0.0057) rank(0.0056)
#10: research(0.0068) system(0.0053) data(0.004) science(0.0039) information(0.0038) example(0.0035) technology(0.0032) theory(0.0032) systems(0.003) development(0.003)
#11: film(0.013) episode(0.0054) man(0.0047) story(0.0037) character(0.0037) game(0.003) role(0.0027) show(0.0027) movie(0.0025) love(0.0025)
#12: french(0.012) german(0.011) france(0.0088) la(0.0084) paris(0.0066) germany(0.0065) italian(0.006) saint(0.0054) le(0.0049) italy(0.0046)
#13: business(0.0089) bar(0.0085) million(0.0079) text(0.0067) services(0.0061) market(0.0058) bank(0.0053) companies(0.0045) till(0.0044) industry(0.0043)
#14: river(0.012) station(0.011) park(0.01) road(0.0099) building(0.0078) street(0.0069) railway(0.0067) island(0.0065) route(0.0062) lake(0.0062)
#15: church(0.017) london(0.011) england(0.0085) william(0.0073) king(0.0063) sir(0.006) son(0.0055) royal(0.0055) henry(0.0048) george(0.0047)
#16: james(0.006) canadian(0.005) canada(0.0047) george(0.0047) david(0.0046) texas(0.0046) smith(0.0046) robert(0.0041) william(0.004) chicago(0.004)
#17: san(0.011) la(0.011) japan(0.01) china(0.01) chinese(0.0093) japanese(0.0082) spanish(0.0081) el(0.0078) mexico(0.0068) del(0.0055)
#18: album(0.02) song(0.015) band(0.014) you(0.0085) chart(0.0067) songs(0.0065) records(0.0065) track(0.0059) rock(0.0057) guitar(0.0055)
ThetaSnippet (ThetaSnippet)
ItemID=990: 0.20077 0.00000 0.00030 0.00001 0.00000 0.04273 0.00757 0.49958 0.00000 0.00000 0.00000 0.00000 0.00000 0.01382 0.00002 0.00014 0.23506 0.00000
ItemID=991: 0.10054 0.00000 0.00000 0.00000 0.00000 0.06029 0.00104 0.52863 0.00000 0.00000 0.00000 0.00291 0.00116 0.00000 0.03528 0.00012 0.27001 0.00000
ItemID=992: 0.00013 0.00000 0.00802 0.05240 0.00003 0.00002 0.91485 0.00000 0.00415 0.00007 0.00000 0.00066 0.01813 0.00151 0.00000 0.00000 0.00000 0.00003
ItemID=993: 0.01469 0.09822 0.00018 0.00000 0.52146 0.00000 0.00236 0.00000 0.00127 0.00000 0.00000 0.00000 0.02197 0.00001 0.01722 0.28666 0.03123 0.00472
ItemID=994: 0.00003 0.17517 0.00000 0.00008 0.01605 0.00009 0.00246 0.10641 0.00177 0.33795 0.00001 0.00017 0.01877 0.32463 0.01641 0.00001 0.00000 0.00000
ItemID=995: 0.00463 0.00000 0.00000 0.00000 0.02883 0.01819 0.00000 0.66545 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00004 0.28285 0.00000
ItemID=996: 0.00000 0.00000 0.00000 0.00020 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00110 0.00000 0.00000 0.82002 0.17868 0.00000 0.00000
ItemID=997: 0.00000 0.00002 0.00000 0.00014 0.00000 0.00000 0.00000 0.17164 0.00000 0.00000 0.00000 0.00000 0.00000 0.04331 0.00003 0.00000 0.78486 0.00000
ItemID=998: 0.00000 0.00000 0.00000 0.00000 0.35416 0.00000 0.00000 0.00009 0.55727 0.00000 0.00000 0.00197 0.00000 0.00000 0.00028 0.05404 0.03219 0.00000
ItemID=999: 0.16169 0.00000 0.00000 0.00769 0.00000 0.00046 0.01242 0.35180 0.00000 0.00000 0.00000 0.00000 0.07021 0.00000 0.00000 0.00000 0.39572 0.00000
Saving model to wiki15+3topics.init.model... OK.
Saving model in readable format to wiki15+3topics.phi.txt... OK.
Generating predictions... Perplexity      = 6581.42076
SparsityPhi     = 0.00609
SparsityTheta   = 0.00865
Writing model predictions into wiki15+3topics.theta.txt... OK.
OK.






Perplexity      = 8657.41
SparsityPhi     = 0.823823
SparsityTheta   = 0.493242
================= Iteration 10 took 00:57:36.660
TopTokens (TopTokens):
politics: #1: party(0.013) president(0.0091) law(0.009) court(0.0074) political(0.0071) election(0.0071) minister(0.006) act(0.0056) council(0.0053) committee(0.0048)
nationalities: #2: india(0.022) al(0.019) indian(0.015) chinese(0.014) china(0.014) russian(0.011) bc(0.0088) language(0.0088) greek(0.0086) temple(0.0084)
military: #3: army(0.022) military(0.015) air(0.015) force(0.014) battle(0.013) forces(0.011) ship(0.0099) navy(0.0091) division(0.0078) command(0.0075)
media: #4: award(0.014) art(0.011) radio(0.0097) published(0.0094) awards(0.0093) director(0.009) book(0.0088) tv(0.0072) television(0.0072) books(0.0069)
games: #5: league(0.028) game(0.022) club(0.02) football(0.018) player(0.017) cup(0.015) games(0.014) round(0.011) win(0.0097) championship(0.0091)
biology: #6: species(0.019) water(0.012) white(0.0088) black(0.0078) red(0.0073) common(0.0062) food(0.0056) plant(0.0053) often(0.005) genus(0.005)
engineering #7: power(0.013) design(0.011) air(0.011) airport(0.011) engine(0.01) system(0.0092) car(0.0086) aircraft(0.008) speed(0.0078) engineering(0.0075)
city #8: district(0.035) population(0.025) students(0.02) education(0.019) schools(0.015) center(0.014) town(0.013) census(0.012) township(0.011) texas(0.011)
sports #9: px(0.034) race(0.017) women(0.017) men(0.016) championships(0.016) event(0.016) racing(0.013) gold(0.013) medal(0.011) rank(0.011)
research #10: system(0.0069) research(0.006) example(0.0059) different(0.0052) using(0.005) form(0.0048) often(0.0046) data(0.0046) information(0.0041) must(0.0039)
films #11: film(0.013) you(0.008) man(0.0059) love(0.0055) show(0.0054) episode(0.0051) my(0.0046) live(0.0043) video(0.0041) me(0.0041)
contries #12: la(0.035) french(0.029) german(0.026) france(0.022) germany(0.018) paris(0.015) italian(0.015) spanish(0.012) italy(0.012) saint(0.012)
business #13: business(0.014) services(0.012) million(0.012) bar(0.011) development(0.0098) management(0.0098) text(0.0082) market(0.0077) bank(0.0073) industry(0.0068)
environment #14: river(0.015) station(0.014) park(0.013) road(0.013) building(0.012) town(0.011) street(0.0092) island(0.0085) lake(0.0081) railway(0.0079)
noise #15: church(0.023) william(0.015) london(0.014) england(0.013) king(0.011) james(0.011) george(0.011) thomas(0.0097) son(0.0095) henry(0.0087)
noise #16: florida(0.001) johnson(0.001) jr(0.00097) chicago(0.00093) bob(0.0009) jackson(0.00079) smith(0.00076) davis(0.00076) frank(0.00075) canadian(0.00075)
noise #17: el(0.0016) mexico(0.0012) japan(0.0012) hong(0.001) kong(0.00099) brazil(0.00086) juan(0.00073) josé(0.00069) philippines(0.00069) portuguese(0.00068)
music #18: album(0.0075) song(0.0054) band(0.005) chart(0.0025) records(0.0024) songs(0.0024) track(0.0023) guitar(0.002) recorded(0.0019) vocals(0.0018)
ThetaSnippet (ThetaSnippet)
ItemID=990: 0.58587 0.00000 0.05380 0.00000 0.00000 0.00000 0.00000 0.10717 0.00000 0.00000 0.00000 0.00000 0.00000 0.01235 0.00000 0.02237 0.19027 0.02816
ItemID=991: 0.53334 0.00000 0.04208 0.00000 0.00000 0.00000 0.00000 0.11571 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.02378 0.02619 0.22449 0.03441
ItemID=992: 0.00149 0.00000 0.01222 0.04965 0.00000 0.00216 0.62157 0.00000 0.01960 0.01837 0.00788 0.00186 0.03919 0.01375 0.00000 0.00397 0.02928 0.17902
ItemID=993: 0.02599 0.03138 0.00000 0.00000 0.21381 0.00000 0.00000 0.00000 0.02462 0.00000 0.00000 0.00000 0.01710 0.00000 0.10278 0.48440 0.08263 0.01730
ItemID=994: 0.00892 0.02878 0.00000 0.01450 0.00000 0.01011 0.07798 0.17975 0.00000 0.07754 0.00000 0.00000 0.02400 0.31395 0.00000 0.06228 0.17771 0.02448
ItemID=995: 0.59693 0.00000 0.05696 0.00000 0.00000 0.00000 0.00000 0.10869 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.02886 0.17725 0.03131
ItemID=996: 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.77315 0.18808 0.01926 0.01952
ItemID=997: 0.00000 0.18094 0.00000 0.00000 0.00000 0.00000 0.00000 0.05617 0.00000 0.00000 0.00000 0.00000 0.00000 0.09448 0.00000 0.01915 0.62977 0.01949
ItemID=998: 0.00000 0.00000 0.00000 0.00000 0.22416 0.00000 0.00000 0.01115 0.31323 0.00000 0.00000 0.05260 0.00000 0.00000 0.00389 0.07594 0.31023 0.00879
ItemID=999: 0.47090 0.00000 0.04479 0.00000 0.00000 0.00000 0.00000 0.07229 0.00000 0.00000 0.00000 0.00000 0.04002 0.00000 0.00000 0.02663 0.30815 0.03722
Saving model to wiki15+3topics.new.model... OK.
Saving model in readable format to wiki15+3topics.new.phi.txt... OK.
Generating predictions...
Perplexity      = 8646.89991
SparsityPhi     = 0.82382
SparsityTheta   = 0.49320
Writing model predictions into wiki15+3topics.new.theta.txt... OK.
OK.